We recently finished a project that started as a simple ask: make a few thousand scanned documents searchable, and turned into a case study in how dramatically the OCR landscape has shifted in the past twelve months. We went from a fragile pipeline stitched together from three different OCR engines to a single 0.9-billion-parameter vision-language model running on a Mac. The new system is simpler, more accurate, and every byte of data stays on premises. Here's what happened.
The Problem
Our client had roughly 3,200 project folders on a QNAP NAS containing years of scanned contracts, compliance forms, and business records across six internal brands. The documents were a mix of clean typed PDFs and scanned pages with handwritten fields: names, dates, signatures, annotations. Some had been scanned on whatever was available at the time, resulting in skewed pages, inconsistent quality, and mixed content on the same sheet.
The requirement was straightforward: make all of it searchable from a web interface so staff could find specific documents without manually opening folders. The constraint was equally straightforward: the data could not leave the office network. No cloud APIs, no uploading pages to third-party services. Everything had to run on hardware they already owned.
What We Tried First (And Why It Failed)
TrOCR
We started with Microsoft's TrOCR, a transformer-based OCR model with strong benchmark numbers available through HuggingFace. It produced excellent results on single lines of clean typed text. The problem: TrOCR is a line-level model. Processing a full document page requires segmenting it into individual text lines first, and complex layouts (forms with tables, multi-column pages, mixed text sizes) led to incorrect segmentation that cascaded into garbage output. Handwritten content in the middle of a typed form was particularly bad. We pulled it from the pipeline.
EasyOCR
EasyOCR is an open-source Python library with GPU acceleration and support for 80+ languages. For mostly-typed scanned pages it worked fine and became our fallback. But on the documents that mattered most (those with mixed typed and handwritten content), confidence scores on handwritten fields dropped below any usable threshold. Signatures and cursive were essentially unreadable.
PaddleOCR
PaddleOCR 3.4.0 from Baidu was our most promising candidate. It includes a dedicated handwriting recognition model and has shown strong academic benchmark results on mixed-content documents. In practice, we hit persistent model download failures, version compatibility issues between the detection and recognition models, and runtime crashes that required manual workarounds. When it worked, the results were good. Getting it to work reliably in production was the problem.
The State of the Pipeline Before the Rewrite
At this point we had a 765-line Python script that routed pages through three different OCR engines depending on content type, with confidence thresholds, fallback logic, and special-case handling for specific document categories. It was fragile, slow to run, and the handwriting recognition (the thing we needed most) was the weakest link. The indexing pipeline touched CUDA, multiple model downloads, and two separate ML frameworks. Maintaining it was a tax on every other part of the project.
The Rewrite: One Model, One API Call
In early 2026, a new class of lightweight vision-language models purpose-built for OCR started appearing. These aren't general-purpose chatbots that happen to read images; they're specialist models trained specifically on document understanding, but built on the same transformer architecture that powers large language models. The key difference from traditional OCR: they don't just recognize characters. They use language context to interpret what they're seeing, the same way a human can read bad handwriting by understanding what word makes sense in context.
We evaluated three of these models: GLM-OCR from Zhipu AI, PaddleOCR-VL from Baidu, and Qwen2.5-VL from Alibaba. All three are open-weight and run locally. GLM-OCR won on the criteria that mattered to us: best handwriting performance in independent user testing, the smallest footprint at 0.9 billion parameters, native Ollama support for dead-simple deployment, and structured JSON extraction via prompt schema.
Deployment
The entire OCR inference engine is GLM-OCR running on Ollama on an Apple Mac Studio M1 Ultra with 64 GB of unified memory. The model uses roughly 2 GB. Installation was one command: ollama pull glm-ocr.
A Python script on a separate Ubuntu workstation orchestrates the pipeline: it walks the NAS mount, renders each PDF page to a 300 DPI image using PyMuPDF, sends the image over HTTP to the Mac's Ollama endpoint, and writes the extracted text into PostgreSQL. The script is about 400 lines, handles resume-after-interruption via database state, and processes pages sequentially at roughly 3 pages per minute.
That's the whole system. No CUDA dependencies on the processing server. No model weight downloads that break between versions. No routing logic between multiple engines. One model, one HTTP call per page, deterministic behavior.
What Changed in Practice
The 765-line multi-engine pipeline became a 400-line single-model script. Three OCR frameworks with separate dependencies became one Ollama model. GPU coordination on the processing server became unnecessary; the Python script is CPU-only, just rendering images and making HTTP calls. The Mac handles inference.
More importantly, the results are better. Handwritten names, dates, and annotations on scanned forms come through clearly. The model reads a signature field and returns the name. It reads a date scrawled in the margin and returns a date. It handles skewed scans, mixed typed-and-handwritten content, and degraded copies without special-case logic because the vision-language architecture treats the whole page as context.
Search Architecture
Extracted text goes into PostgreSQL 16 using its built-in full-text search. Each page gets a tsvector column that's automatically generated from the OCR content, indexed with a GIN index for fast lookup.
Search uses websearch_to_tsquery for natural query parsing with support for phrases, boolean operators, and stemming. Results are ranked by ts_rank_cd (cover density ranking) and returned with highlighted snippet excerpts via ts_headline. A query across the entire document corpus returns in under two seconds.
The frontend is a React 19 single-page application backed by an ASP.NET Core 9 Web API. Users search by keyword, filter by brand, and see ranked results with highlighted matches and page numbers. An admin panel triggers and monitors indexing jobs with real-time progress.
PostgreSQL's full-text search handles the core use case of finding documents by what's in them, without requiring a separate search engine like Elasticsearch. For an archive of this size, it's the right tool. The GIN indexes are efficient, the ranking is good, and there's no additional infrastructure to maintain.
The Data Sovereignty Angle
This project could have been done with cloud OCR APIs. Google Cloud Vision, AWS Textract, and Azure Cognitive Services all offer document OCR endpoints that would have produced comparable results with less engineering effort. We didn't use any of them, and the reason was simple: the documents contain sensitive information that cannot leave the client's network.
Running the entire pipeline on premises (NAS storage, OCR inference on a Mac, database on a Linux workstation, web interface served internally) means no document page is ever transmitted to a third party. There are no API keys, no per-page billing, no data processing agreements to negotiate, and no risk surface beyond the office LAN.
The economics work out too. Cloud OCR APIs typically cost $1-15 per thousand pages. Self-hosted inference on hardware the client already owned costs electricity. For an archive that needs to be re-indexed when models improve or search requirements change, the difference compounds quickly.
The key enabler is model size. At 0.9 billion parameters, GLM-OCR runs comfortably on consumer hardware. The M1 Ultra we used is a capable machine, but it's a standard Mac Studio, not a data center GPU. The model would run on an M1 MacBook if needed, just slower. This class of efficient, specialized VLM is what makes on-premises document intelligence practical for small and mid-size organizations that couldn't justify an A100 cluster.
What the Hardware Actually Looks Like
No rack of GPUs. No cloud instances. Three machines on a 10-gigabit office network:
Processing server (Ubuntu 24.04 workstation, Intel i9-9900K, 128 GB RAM): Runs the Python orchestration script, hosts PostgreSQL with 7 TB of available storage, and serves the .NET API and React frontend. Its GPUs are dedicated to other workloads; the OCR pipeline doesn't use them.
Inference server (Apple Mac Studio, M1 Ultra, 64 GB unified memory): Runs Ollama with GLM-OCR for document OCR and a separate 32B chat model for natural language queries against the project database. The OCR model uses roughly 2 GB of the 64 GB available.
Storage (QNAP NAS, 10G bonded): Holds the source documents, mounted read-only on the processing server via SMB.
The total hardware cost for the OCR-specific capability was zero; the Mac and the workstation were already in place for other projects. The only new software was Ollama (free) and GLM-OCR (open-weight, MIT license).
Lessons Learned
Specialized small models beat general large models for OCR. GLM-OCR at 0.9B parameters outperforms general-purpose VLMs many times its size on document extraction benchmarks. We evaluated Qwen2.5-VL (a much larger general model) and it would have burned significantly more compute for comparable or worse results on structured document extraction. If your task is well-defined, a specialist model is the right choice.
Simplicity is a feature. The old pipeline had three OCR engines, confidence-based routing, and fallback logic. The new pipeline has one model and one code path. When something goes wrong (and it does: pages time out, PDFs are corrupt, scans are unreadable) the failure mode is obvious and the fix is rerunning the page. There's no interaction between models to debug.
Ollama changed the deployment story. Being able to install an OCR model with ollama pull glm-ocr and serve it over HTTP with zero configuration is what made this project economical. Two years ago, deploying a vision model on premises meant wrestling with PyTorch versions, CUDA drivers, model weight conversions, and GPU memory management. Ollama abstracts all of that.
Resume-after-failure is not optional at scale. Our corpus processes in roughly 80 hours. The script uses the database as its checkpoint; each page is a row with a status column. If the process is interrupted, restarting picks up exactly where it left off. This was the single most important design decision after choosing the model.
PostgreSQL full-text search is underrated. For a document corpus under a million pages, tsvector with GIN indexes handles search with no additional infrastructure. The ranking is good, the highlighting works, phrase queries work, and it's one less service to deploy and maintain. We'll add vector embeddings for semantic search later, but keyword search with ranking covers the vast majority of real queries today.
Working With Us
Agave Information Solutions builds AI-powered document search and data engineering systems for organizations with sensitive archives that need to stay on premises. The architecture described here (local OCR inference, PostgreSQL search, on-premises LLM hosting) runs on hardware you may already own. No cloud dependencies, no per-document costs, full data sovereignty.
If you're sitting on years of scanned records that need to be searchable, or you need to extract structured data from forms at scale without sending them to a third party, we'd like to talk.
Contact us at agaveis.com
Agave Information Solutions is a Scottsdale, Arizona-based technology firm specializing in AI infrastructure, data engineering, and custom software development. Founded in 2007.