Notes

doctext

28 July 2025

doctext is an toolkit for converting PDF files to markdown. It does the job of an OCR tool but uses vision language models to extract the data.

A 3 billion parameter model, Nanonets-OCR-s, is provided but it also works with other multi-modal language models.