Case study · AI & RAG
excelRAG: chat with your PDFs, with nothing leaving the machine
A full retrieval-augmented-generation pipeline — local embeddings, a local language model, sourced answers — packaged as a single Windows executable for people who can't send their documents to the cloud.
01The problem
Plenty of people have a folder of PDFs they'd love to just ask questions of — contracts, manuals, technical reports — and plenty of them work somewhere that makes cloud AI a non-starter. The documents are confidential, regulated, or simply not allowed off the premises. Every convenient "chat with your PDF" service is exactly the wrong answer, because the price of the convenience is uploading the sensitive thing. The gap is a tool that gives you the same experience without the data ever leaving the room.
02Constraints
- Truly offline. No API keys, no network calls. If the machine had no internet, it still had to work.
- Runnable by a non-technical user. The target user won't set up Python or a model server. It had to be a double-click.
- Real hardware. It needed to run on ordinary office machines — CPU-only if there's no GPU — without a heroic setup.
- Answers grounded in the documents. Not a chatbot riffing; responses had to come from the actual source text.
03Approach
It's a standard RAG pipeline, built entirely from local pieces. You drag PDFs in; PyMuPDF extracts the text, which is chunked and embedded with a Sentence-Transformers model into vectors held in NumPy. A question is embedded the same way, the nearest chunks are retrieved by similarity, and those chunks plus the question are handed to a quantised GGUF language model (via ctransformers) that writes an answer grounded in the retrieved context. The interface is a Flask app you reach in a browser tab. Models load lazily and run on CPU or GPU, and the whole thing is frozen with PyInstaller into one executable.
04Architecture
Ingest
Drag-drop PDFs → PyMuPDF text → chunk → embed.
Local vector store
Sentence-Transformers embeddings in NumPy. Nearest-chunk retrieval.
Local GGUF LLM
ctransformers. Answers grounded in retrieved chunks. CPU/GPU.
Flask UI · one .exe
Browser chat, everything frozen by PyInstaller.
No arrow in this diagram crosses the network boundary. That's the feature.
05One hard decision & the trade-off
Ship a local quantised GGUF model inside a single .exe, instead of calling a hosted model or asking the user to run a separate model server.
A hosted model would have been faster and smarter per token — but it would have destroyed the one property the whole project exists for, so it was never really an option. Bundling the model instead means the user gets true privacy and a double-click install, at the cost of a large executable, slower inference than a frontier API, and answers from a smaller model. I leaned into that trade with lazy loading and CPU/GPU support so the weight is only paid when needed, and accepted the quality ceiling because for this audience "runs entirely on my machine" beats "a bit smarter, in someone else's cloud" every time.
06Outcome
excelRAG lets a non-technical user on a locked-down machine do the thing everyone wants — ask their documents questions and get sourced answers — without any of it touching the internet. It's the clearest demonstration of the two things I care about together: real AI engineering (embeddings, retrieval, local inference) and real shipping discipline (a whole pipeline that a normal person runs from one file).
Tech
- Python
- Flask
- Sentence-Transformers
- PyMuPDF
- NumPy
- ctransformers + GGUF
- PyInstaller
07FAQ
Is it really fully offline?
Yes. Both halves of the pipeline run locally: a Sentence-Transformers model produces the embeddings and a quantised GGUF language model generates the answers, all on the user's own hardware. There are no API keys and no network calls, so the documents never leave the machine — which is the entire reason it exists.
How does a non-technical user run a RAG stack?
It's packaged with PyInstaller into a single Windows executable. The user double-clicks it, drops in PDFs and asks questions in a browser tab served by the bundled Flask app. All the machinery — Python, the embedding model, the local LLM — is inside the one file; there is nothing to pip install.
Won't a local language model be painfully slow?
It's slower than a hosted frontier model, but usable, and it's engineered for that. The app supports both CPU and GPU inference, and loads the heavy models lazily so startup is quick and memory isn't consumed until you actually ask something. For privacy-sensitive documents, running locally is the point — the latency is a fair trade for never sending the data anywhere.
Need an internal tool that runs without IT approval?
Let's talk. Offline AI, self-contained apps, and tools that respect where your data is allowed to go — that's the work.
Get in touch