Private preview · Apache 2.0

The execution-aware AI gateway for Kubernetes.

Self-hosted. OpenAI-compatible. Optimizes how tokens execute on your GPUs. Tells you who owes the bill.

change one line.

The problem

LLM inference on Kubernetes is leaking money in three places.

60%

GPUs sit idle.

Even on top inference platforms, GPU utilization rarely cracks 40% in production. KV-cache fragmentation, dumb scheduling, and over-provisioning quietly burn the budget.

$?

Nobody knows who owes the bill.

Engineering, product, and finance argue every quarter. Generic FinOps tools don't understand the workload-level details that actually drive LLM cost — so the chargeback report never gets trusted.

Prompts leak to a third party.

SaaS gateways are a procurement gate for any team handling PII, financial data, healthcare records, or proprietary code. There is no path forward without a self-hosted option.

What we do

One install. Three load-bearing pillars.

01
Smart Gateway
On the hot path, by design.

A drop-in OpenAI-compatible gateway that lives inside your cluster, in front of your inference workloads. Streaming responses pass through byte-for-byte. Per-tenant policy, intelligent routing, and full request-level visibility — without your prompts ever leaving your network.

  • OpenAI-compatible — change a base URL, nothing else
  • Per-tenant entitlements, rate limits, and quotas
  • First-party visibility into every request
02
FinOps for LLMs
Numbers you can defend in budget review.

Generic Kubernetes FinOps tools treat an LLM pod as a generic pod. We don't. We report exact cost-per-token by workload, model, and team — broken down the way LLM inference actually behaves. Daily chargeback rollups with CSV export, ready for finance.

  • Cost-per-token by model · team · workload
  • Per-team chargeback with CSV export
  • Idle GPU waste tracked separately, never folded in
03
Recommendations + Apply
Causal, not advisory.

Concrete, dollar-quantified recommendations for cutting waste — not a wall of dashboards waiting for someone to act. Preview the change, apply with one click, roll back on the same screen. Every change snapshots, every action logged. Always human-in-the-loop.

  • Dollar-range savings with confidence notes
  • One-click preview · apply · rollback
  • Snapshot and audit log on every change

Why us

For platform teams who need an AI gateway and cannot send prompts to a SaaS, KubeBurner is the only one that's also execution-aware on the GPUs underneath.

Control plane vs. execution plane A two-layer diagram. The top, narrower layer is labeled "control plane" with the caption "request, choose model, send request". A vertical request arrow descends from above, passes through the control plane, and continues into the wider, accent-colored "execution plane" below — landing inside one highlighted GPU cell. The execution plane caption reads "request, choose GPU, reuse cache, schedule tokens, optimize execution".
Other AI gateways operate at the control plane: pick a model, send the request, log the result. KubeBurner operates one level deeper, on the execution plane: which GPU the request lands on, whether the cache is reused, how the workload is balanced, whether resources are fairly shared.
Other AI gateways

Operate at the control plane: pick a model, send the request, log the result. Useful, but they leave the harder problem — what actually happens on your GPUs — to someone else.

KubeBurner

Operates one level deeper: which GPU the request lands on, whether the cache is reused, how the workload is balanced, whether resources are fairly shared. The level where cost and latency are actually decided.

Approach

Four principles. No surprises.

01

Self-hosted, end to end.

Runs entirely inside your Kubernetes cluster. Your prompts, your data, your billing details — none of it leaves your network.

02

Drop-in compatible.

OpenAI-compatible from day one. Point your applications at our gateway and the rest of your code continues to work.

03

First-party measurement.

We sit on the request path, so the numbers are what actually happened — not estimates, not inferred from indirect metrics.

04

Always human-in-the-loop.

Every recommendation is a suggestion until you accept it. Preview, apply, roll back. Nothing is auto-changed in your cluster.

Where we are

Private preview. Working with a small group of design partners.

What design partners get
  • Private install in your own cluster
  • Direct line to the team
  • Influence over the roadmap
  • Apache 2.0 — your install is yours, forever
Get in touch

We're a fit if you're running production LLM inference on Kubernetes, care about cost and latency, and want to keep your prompts in your own cluster.

Send a note to hello@kubeburner.ai and we'll set up a short call.