Back to projects
Production NDA — described architecturally

Air-Gapped RAG Platform

Production Q&A system for regulated enterprise environments

Architected and deployed a production-grade, deterministic RAG system fully deployable in secure air-gapped enterprise environments — zero external API calls. Docling-powered document ingestion, semantic chunking, vector embedding via Sentence Transformers, and vLLM-served LLM inference behind FastAPI microservices.

vLLMElasticsearchSentence TransformersDoclingFastAPIDockerGitLab CI

Problem

Enterprise clients in regulated industries needed LLM-powered document Q&A on sensitive internal data — but couldn't allow any data to leave their network perimeter.

Architecture

  • **Ingestion:** Docling extracts and chunks documents (PDF, DOCX, HTML)
  • **Embedding:** Sentence Transformers (all-MiniLM-L6-v2) encode chunks → stored in Elasticsearch dense_vector fields
  • **Retrieval:** Semantic + BM25 hybrid search with re-ranking
  • **Generation:** vLLM serves open-source LLMs (Llama 3, Mistral) fully on-premise
  • **API:** FastAPI microservices with streaming response support
  • **Deployment:** Fully containerised via Docker Compose; CI/CD via GitLab
  • Key Engineering Decisions

    Why vLLM over TGI: better continuous batching, PagedAttention for memory efficiency, OpenAI-compatible API surface makes swapping models trivial.

    Why Elasticsearch over Chroma/FAISS: enterprise-grade reliability, existing ops tooling, hybrid BM25 + dense retrieval out of the box.

    Outcome

    Successfully deployed in two regulated enterprise environments. Deterministic retrieval pipeline with audit logs for every query-document pair.