Two ways people use “prompt caching”
The same phrase covers two different things:- Engine-level prefix caching. The serving engine reuses already-computed key/value (KV) blocks when a later request shares a token prefix. In vLLM this is on by default, lives in GPU memory, and is ephemeral: cached blocks sit in the normal KV pool until the allocator reclaims them. It skips prefill work for the shared prefix only; it does not reduce the cost of generating new tokens.
- Product-level prompt caching. The API reports cache hits, bills cached tokens at a discount, and promises that a prefix stays cached long enough to rely on. To make that cheap and dependable, providers persist the cache so it outlives a single request and can be billed against. That durability is the part a confidential deployment has to treat carefully.
How providers make caching pay
A cache discount is only worth offering if cache hits are frequent and long-lived. In GPU memory, KV blocks are evicted as soon as live requests need the space, so providers move them off GPU memory and use disk-backed persistence when they need durable caches (OpenRouter describes this). Caches are typically scoped per organization, so a tenant reuses its own prefixes rather than sharing across tenants. KV tensors were long assumed hard to invert back into text, but recent research (USENIX Security 2025, cited by OpenRouter) challenges that, so paged KV is data worth protecting.Caching and confidential inference
Tinfoil could build the same thing with vLLM: LMCache offloads KV to CPU DRAM, local disk, or a remote store and reuses it across instances. But the cheap, durable on-disk persistence that makes such a cache dependable sits in tension with the secure-enclave guarantee. In a confidential-computing deployment the GPU runs alongside a CPU confidential VM, so the encrypted VM memory and the CPU-GPU transfers are inside the trust boundary. KV held there stays protected, and it could even be shared with another attested enclave that keeps caches in memory. But memory is volatile and limited; the cheap way to keep a cache durable is to write it to disk. A single live enclave can encrypt its own cache to disk and read it back as long as it keeps the key. The complexity arises once several enclaves serve the same model and must read the same disk-backed cache: without an attested key-coordination scheme, the alternatives are a key handed to the client (which breaks API compatibility) or an escrow service outside the enclave (which adds a great deal of complexity to the trusted code base).What Tinfoil does today
- Engine-level prefix caching runs inside the enclave where it speeds up repeated prefixes while their blocks remain resident.
- KV cache stays in enclave memory. Tinfoil does not write it to disk for durable reuse.
- Tinfoil does not steer requests to keep a cache warm, and does not report cache hits or discount cached tokens.

