Case study · AI & RAG

excelRAG: chat with your PDFs, with nothing leaving the machine

A full retrieval-augmented-generation pipeline — local embeddings, a local language model, sourced answers — packaged as a single Windows executable for people who can't send their documents to the cloud.

Python · Flask · Sentence-Transformers · GGUF Status: working, packaged as a standalone .exe

01The problem

Plenty of people have a folder of PDFs they'd love to just ask questions of — contracts, manuals, technical reports — and plenty of them work somewhere that makes cloud AI a non-starter. The documents are confidential, regulated, or simply not allowed off the premises. Every convenient "chat with your PDF" service is exactly the wrong answer, because the price of the convenience is uploading the sensitive thing. The gap is a tool that gives you the same experience without the data ever leaving the room.

02Constraints

Truly offline. No API keys, no network calls. If the machine had no internet, it still had to work.
Runnable by a non-technical user. The target user won't set up Python or a model server. It had to be a double-click.
Real hardware. It needed to run on ordinary office machines — CPU-only if there's no GPU — without a heroic setup.
Answers grounded in the documents. Not a chatbot riffing; responses had to come from the actual source text.

03Approach

It's a standard RAG pipeline, built entirely from local pieces. You drag PDFs in; PyMuPDF extracts the text, which is chunked and embedded with a Sentence-Transformers model into vectors held in NumPy. A question is embedded the same way, the nearest chunks are retrieved by similarity, and those chunks plus the question are handed to a quantised GGUF language model (via ctransformers) that writes an answer grounded in the retrieved context. The interface is a Flask app you reach in a browser tab. Models load lazily and run on CPU or GPU, and the whole thing is frozen with PyInstaller into one executable.

04Architecture

Ingest

Drag-drop PDFs → PyMuPDF text → chunk → embed.

Local vector store

Sentence-Transformers embeddings in NumPy. Nearest-chunk retrieval.

Local GGUF LLM

ctransformers. Answers grounded in retrieved chunks. CPU/GPU.

Flask UI · one .exe

Browser chat, everything frozen by PyInstaller.

No arrow in this diagram crosses the network boundary. That's the feature.

05One hard decision & the trade-off

The decision

Ship a local quantised GGUF model inside a single .exe, instead of calling a hosted model or asking the user to run a separate model server.

A hosted model would have been faster and smarter per token — but it would have destroyed the one property the whole project exists for, so it was never really an option. Bundling the model instead means the user gets true privacy and a double-click install, at the cost of a large executable, slower inference than a frontier API, and answers from a smaller model. I leaned into that trade with lazy loading and CPU/GPU support so the weight is only paid when needed, and accepted the quality ceiling because for this audience "runs entirely on my machine" beats "a bit smarter, in someone else's cloud" every time.

06Outcome

100%offline · no API keys

1.exe, double-click to run

0bytes of data leave the machine

excelRAG lets a non-technical user on a locked-down machine do the thing everyone wants — ask their documents questions and get sourced answers — without any of it touching the internet. It's the clearest demonstration of the two things I care about together: real AI engineering (embeddings, retrieval, local inference) and real shipping discipline (a whole pipeline that a normal person runs from one file).

Tech

Python
Flask
Sentence-Transformers
PyMuPDF
NumPy
ctransformers + GGUF
PyInstaller

07FAQ

Is it really fully offline?

Yes. Both halves of the pipeline run locally: a Sentence-Transformers model produces the embeddings and a quantised GGUF language model generates the answers, all on the user's own hardware. There are no API keys and no network calls, so the documents never leave the machine — which is the entire reason it exists.

How does a non-technical user run a RAG stack?

It's packaged with PyInstaller into a single Windows executable. The user double-clicks it, drops in PDFs and asks questions in a browser tab served by the bundled Flask app. All the machinery — Python, the embedding model, the local LLM — is inside the one file; there is nothing to pip install.

Won't a local language model be painfully slow?

It's slower than a hosted frontier model, but usable, and it's engineered for that. The app supports both CPU and GPU inference, and loads the heavy models lazily so startup is quick and memory isn't consumed until you actually ask something. For privacy-sensitive documents, running locally is the point — the latency is a fair trade for never sending the data anywhere.

Need an internal tool that runs without IT approval?

Let's talk. Offline AI, self-contained apps, and tools that respect where your data is allowed to go — that's the work.

Get in touch

← Back to all work Read: 90 days of unattended AI →