Skip to main content
Prefix caching runs inside the secure enclave as a latency optimization. What Tinfoil does not expose is a customer-facing prompt cache: cache-hit reporting or cached-token discounts. The reason is specific to confidential inference: the cheap, durable on-disk caching that makes those discounts work elsewhere depends on persistent key management that does not yet fit cleanly inside an attested, ephemeral enclave.

Two ways people use “prompt caching”

The same phrase covers two different things:
  1. Engine-level prefix caching. The serving engine reuses already-computed key/value (KV) blocks when a later request shares a token prefix. In vLLM this is on by default, lives in GPU memory, and is ephemeral: cached blocks sit in the normal KV pool until the allocator reclaims them. It skips prefill work for the shared prefix only; it does not reduce the cost of generating new tokens.
  2. Product-level prompt caching. The API reports cache hits, bills cached tokens at a discount, and promises that a prefix stays cached long enough to rely on. To make that cheap and dependable, providers persist the cache so it outlives a single request and can be billed against. That durability is the part a confidential deployment has to treat carefully.
Tinfoil runs the first where it helps. It does not expose the second today.

How providers make caching pay

A cache discount is only worth offering if cache hits are frequent and long-lived. In GPU memory, KV blocks are evicted as soon as live requests need the space, so providers move them off GPU memory and use disk-backed persistence when they need durable caches (OpenRouter describes this). Caches are typically scoped per organization, so a tenant reuses its own prefixes rather than sharing across tenants. KV tensors were long assumed hard to invert back into text, but recent research (USENIX Security 2025, cited by OpenRouter) challenges that, so paged KV is data worth protecting.

Caching and confidential inference

Tinfoil could build the same thing with vLLM: LMCache offloads KV to CPU DRAM, local disk, or a remote store and reuses it across instances. But the cheap, durable on-disk persistence that makes such a cache dependable sits in tension with the secure-enclave guarantee. In a confidential-computing deployment the GPU runs alongside a CPU confidential VM, so the encrypted VM memory and the CPU-GPU transfers are inside the trust boundary. KV held there stays protected, and it could even be shared with another attested enclave that keeps caches in memory. But memory is volatile and limited; the cheap way to keep a cache durable is to write it to disk. A single live enclave can encrypt its own cache to disk and read it back as long as it keeps the key. The complexity arises once several enclaves serve the same model and must read the same disk-backed cache: without an attested key-coordination scheme, the alternatives are a key handed to the client (which breaks API compatibility) or an escrow service outside the enclave (which adds a great deal of complexity to the trusted code base).

What Tinfoil does today

  • Engine-level prefix caching runs inside the enclave where it speeds up repeated prefixes while their blocks remain resident.
  • KV cache stays in enclave memory. Tinfoil does not write it to disk for durable reuse.
  • Tinfoil does not steer requests to keep a cache warm, and does not report cache hits or discount cached tokens.
The effect is that request payloads and derived KV stay inside the live confidential-computing trust boundary. The trade-off is that Tinfoil does not pass through a cached-token discount.

Why in-GPU caching isn’t enough

Even setting confidentiality aside, keeping the cache in GPU memory limits how much it can do. The cache shares the same HBM as live traffic, and that space is tight: model weights already consume most of a node’s memory, leaving a KV pool that the scheduler prioritizes for in-flight requests. On a typical node serving a large model, that pool is a few hundred gigabytes shared across every concurrent session, and a single long-context request can occupy tens of gigabytes on its own. A cached prefix lives in that same pool, so it survives only until live requests need the blocks back. It helps when a tenant repeats a long prefix while those blocks are still resident, but hit rates depend on load and traffic patterns. That makes in-GPU caching a useful latency optimization, not a dependable store you can build a billing guarantee on, which is why inference providers opt for off-GPU storage to back a cache product.

When this could change

Two things would move this. More HBM per node (for example, Vera Rubin-class systems) widens the in-enclave KV pool, so prefix caching stays resident longer and hit rates improve without writing to disk at all. Separately, attested enclaves can coordinate a shared cache-encryption key among themselves and seal it locally, so a durable on-disk cache is readable across a fleet without handing keys to the client or to an escrow service outside the enclave. That second path is an engineering-complexity decision rather than a missing hardware capability, and we are evaluating it with a view to revisiting durable caching and cache-based pricing in the future.