Skip to main content
TL;DR: How to set up private RAG with Verba + Tinfoil in 5 steps:
  1. Clone the Tinfoil fork of Verba from GitHub.
  2. Install dependencies and set up Weaviate (local or cloud).
  3. Configure Tinfoil API keys for both embeddings and chat models.
  4. Import your documents through Verba’s interface.
  5. Start chatting with your private knowledge base!
Your documents and conversations never leave your secure environment or get exposed to third parties.

Introduction

Verba is an open-source Retrieval Augmented Generation (RAG) application that lets you chat with your own documents using AI. The Tinfoil fork extends Verba to work seamlessly with Tinfoil’s private inference API, ensuring your documents and conversations remain completely confidential. This integration brings together Verba for document ingestion and RAG pipeline management, Weaviate as your vector database for semantic search, and Tinfoil’s confidential computing infrastructure for private embeddings and chat completions. The result is a fully private knowledge base where your sensitive documents never leave your controlled environment.

Prerequisites

Before you begin, make sure you have a Tinfoil API key from tinfoil.sh, Git installed to clone the repository, and Docker for running the containerized Weaviate deployment.
You’re billed for all usage of the Tinfoil Inference API. See the Tinfoil Inference for current pricing information.
Security Warning Never share your API key, be careful to not include it in version control systems, and never bundle it in with front-end client code.

Installation and Setup

Step 1: Clone the Tinfoil Verba Fork

The Tinfoil fork includes pre-configured integrations for Tinfoil’s API endpoints:
git clone https://github.com/tinfoilsh/Verba.git
cd Verba
git checkout tinfoil

Step 2: Set Tinfoil API key

export TINFOIL_API_KEY=xxx

Step 3: Bring up Docker Compose

docker-compose up -d

Choosing Your Models

Verba can work with any of Tinfoil’s supported models. For chat models, you can choose from our high-performance reasoning models, advanced multimodal models, and multilingual dialogue models. For embeddings, we recommend using nomic-embed-text for high-quality text embeddings. See our model catalog for the complete list of available models and their capabilities.

Running Verba

Start the Application

Launch Verba with your configuration:
verba start --port 8000 --host 0.0.0.0
Access Verba at http://localhost:8000

Using Verba with Tinfoil

Document Import and Processing

Navigate to the “Import Data” section to upload individual files, import entire directories, or pull content from URLs. Verba supports PDF, DOCX, TXT, MD, and HTML files. For bulk imports, use the command line:
python -m goldenverba.import --directory ./my_documents --chunk_size 512
Verba processes documents through an automated pipeline that extracts text, chunks content into smaller segments, generates embeddings, and stores vectors in Weaviate. The chunking step splits long documents into manageable pieces that can be retrieved without overwhelming the context window. You can customize these processing settings in the Verba UI. Chunk size determines segment size in tokens (roughly equivalent to words). 512 tokens works well for most use cases. Use 256 tokens for more precise retrieval of dense technical content, or 1024 tokens to capture larger concepts. Chunk overlap ensures important information isn’t lost at chunk boundaries. Setting this to around 20% of your chunk size creates a buffer where adjacent chunks share content. With 512-token chunks, a 100-token overlap means each chunk shares its last 100 tokens with the next chunk. Verba generates embeddings using Tinfoil’s nomic-embed-text model within confidential computing enclaves. These embeddings are stored in Weaviate as a searchable vector database.

Querying Your Knowledge Base

Ask questions through Verba’s chat interface and it will search your documents to find relevant information before generating a response. The RAG pipeline first retrieves the most relevant chunks from your vector database. Retrieval count controls how many chunks to retrieve, typically 3-5. Fewer chunks (3) provide more focused answers, while more chunks (5) give broader context but may include less relevant information. Retrieved chunks are sent to Tinfoil’s language model along with your question. Temperature affects response creativity: lower values (0.1-0.3) work best for factual queries that stick closely to your documents, while higher temperatures allow more creative interpretation. Max tokens limits response length, useful for keeping answers concise or allowing detailed explanations. Context window determines how much retrieved content gets included in the prompt. Verba provides source attribution for every response, showing which documents and pages were used. You’ll see confidence scores for each chunk and direct links to source material for verification.

Scaling Your Deployment

Run multiple Verba instances behind a load balancer to distribute query traffic across servers and maintain fast response times. Weaviate supports clustering to distribute document embeddings across multiple nodes, useful for millions of documents or high query volumes. Separate document ingestion from query workloads by running them on dedicated infrastructure, since processing and embedding documents is compute-intensive.