Production NDA — described architecturally

Air-Gapped RAG Platform

Production Q&A system for regulated enterprise environments

Architected and deployed a production-grade, deterministic RAG system fully deployable in secure air-gapped enterprise environments — zero external API calls. Docling-powered document ingestion, semantic chunking, vector embedding via Sentence Transformers, and vLLM-served LLM inference behind FastAPI microservices.

Tech stack

vLLMElasticsearchSentence TransformersDoclingFastAPIDockerGitLab CI

Problem

Enterprise clients in regulated industries needed LLM-powered document Q&A on sensitive internal data — but couldn't allow any data to leave their network perimeter.

Architecture

▸**Ingestion:** Docling extracts and chunks documents (PDF, DOCX, HTML)

▸**Embedding:** Sentence Transformers (all-MiniLM-L6-v2) encode chunks → stored in Elasticsearch dense_vector fields

▸**Retrieval:** Semantic + BM25 hybrid search with re-ranking

▸**Generation:** vLLM serves open-source LLMs (Llama 3, Mistral) fully on-premise

▸**API:** FastAPI microservices with streaming response support

▸**Deployment:** Fully containerised via Docker Compose; CI/CD via GitLab

Key Engineering Decisions

Why vLLM over TGI: better continuous batching, PagedAttention for memory efficiency, OpenAI-compatible API surface makes swapping models trivial.

Why Elasticsearch over Chroma/FAISS: enterprise-grade reliability, existing ops tooling, hybrid BM25 + dense retrieval out of the box.

Outcome

Successfully deployed in two regulated enterprise environments. Deterministic retrieval pipeline with audit logs for every query-document pair.

All projects