IAOpen Source

doc2md: 81% Fewer Tokens When Processing Documents with AI Agents

6 minLucas Mattos

There is a cost most people ignore until they see the bill. When you send a PDF directly to an agent, it does not "read" the file — it consumes the entire binary structure: layout, fonts, metadata, formatting layers. The content you wanted is in there, but you paid for everything around it. doc2md solves this before the agent even opens its mouth.

The library underneath: Microsoft's markitdown

Before anything else: I did not invent the conversion to Markdown. The heavy lifting is done by markitdown — an open source library from Microsoft that runs locally via terminal, no API, no cloud.

What I did was wrap that library in an AI agent skill. The difference is context: instead of passing parameters manually every time, I standardized the process for my workflow — .raw-docs/ as input, docs/ as output, cache to avoid reprocessing what was already converted, direct integration with graphify for code projects.

The credit for the conversion belongs to Microsoft. The skill is the convenience layer on top of it.

Why Markdown became the lingua franca of agents

Markdown is clean text with minimal structure. LLMs read it better, process it more efficiently, and make fewer interpretation errors. It is no coincidence that Spec-Driven Development, graphify, and most modern agent pipelines work with .md as the standard format.

The problem: the world does not live in Markdown. It lives in PDF, DOCX, PPTX, XLSX, emails, spreadsheets, YouTube videos. Converting manually is slow. Letting the agent convert is expensive. You need a step between the two worlds — one that costs zero tokens.

What doc2md is

An AI agent skill. You drop files in .raw-docs/, run /doc2md, and the .md files appear in docs/ ready for any agent to consume.

Supported formats: documents like PDF, DOCX, PPTX, XLSX, and CSV; web content like YouTube (full transcript), Wikipedia, and any HTML; and also EPUB, Jupyter Notebook, Outlook email, and ZIP.

Built-in MD5 cache — already converted files are not reprocessed. Watch mode converts new files automatically every 3 seconds. Stack: markitdown (Microsoft, open source), Python stdlib. No API, no cloud, no key.

The numbers that matter

Example: 50-page report, ~12,000 words. Sent directly to the agent as PDF: ~90,000 tokens — equivalent to ~$0.27 per document on Claude Sonnet. Pre-processed with doc2md, the same content arrives as ~9,000 tokens — ~$0.027. 81% fewer tokens, ~$0.24 saved per document.

As volume grows: 10 docs/month → ~$2.40 saved. 100 docs/month → ~$24.00 saved. 500 docs/month → ~$120.00 saved — with no additional effort.

And tokens are only the visible part. Agents that receive well-structured Markdown make fewer mistakes, request less reprocessing, and produce more accurate outputs on the first try. That also has a cost — it just does not show up directly on the bill.

How it fits in the pipeline

doc2md goes at the beginning of any flow that needs to consume external documents.

For YouTube specifically — talks, tutorials, recorded meetings — the gain is direct: instead of sending the link to the agent to fetch and process, doc2md extracts the full transcript with metadata locally, before any LLM call.

PDF / DOCX / PPTX / YouTube
           ↓
       /doc2md          ← zero tokens, local, instant
           ↓
       clean .md
           ↓
  Claude / graphify     ← receives structured text, not binary
           ↓
  More precise output, lower cost

The reflection that remains

You find an open source library. It does 80% of what you need — but its way, with its parameters, in the order it decided. Before, you were stuck with that. You used it as-is or you did not use it at all.

Now you take the library, wrap it in a skill, and talk to it in natural language to adapt it to your context. Change the folder structure, adjust the flow, integrate with other tools. Without rewriting anything from scratch, without depending on anyone's roadmap.

doc2md is a small example of this — but the principle applies to entire tools. AI is not just automating tasks. It is eliminating the distance between "what exists" and "what I would need to exist." That will become more and more present. And whoever understands this first will stop waiting for someone to build the right tool — and will simply adjust the one that already exists.

What I learned building this

Infrastructure tools have a different ROI than product tools. You do not see the result immediately — but every workflow that passes through it gets a little cheaper, more precise, faster. The effect accumulates silently.

The repository is public. The skill can be copied, modified, adapted to your context — that is the point.

→ github.com/luckmattos/doc2md