document_loader_html_to_markdown
Document Loader HTML to Markdown adds a Document Loader plugin that converts HTML into clean Markdown. If you use the Document Loader module to load documents (e.g. for AI, search, RAG pipelines, or migration), this plugin lets you:
- Convert HTML (headings, paragraphs, bold, italic, links, images, lists, code blocks) to standard Markdown
- Convert tables — HTML
<table>elements are converted to Markdown pipe tables by default - Configure conversion options per-instance (header style, list style, tag stripping, node removal, autolinks, and more)
- Chain with Document Loader HTML Processor — use its cleaned HTML output as input for this plugin
You pass in HTML via DocumentLoaderInputInterface; you get back MarkdownLoaderOutput with getContent() and getMetadata(). The plugin is aimed at developers and site builders who already use or plan to use Document Loader in their pipeline.
Features
What unique features does enabling this project add?
- Table conversion — HTML tables are converted to Markdown pipe tables out of the box (TableConverter is registered by default).
- Configurable converter options — All non-deprecated league/html-to-markdown options can be set as service-level defaults or per-instance via
converter_options:header_style—atx(#headings) orsetext(underline headings). Default:atx.strip_tags— Remove HTML tags without Markdown equivalents, keeping their text. Default:true.remove_nodes— Space-separated list of tags to remove entirely (content included). Default:script style head.strip_placeholder_links— Remove<a>tags withouthref. Default:true.hard_break— Convert<br>to\ninstead of two trailing spaces +\n.suppress_errors— Suppress libxml warnings when loading malformed HTML.list_item_style— List bullet character:-,*, or+.preserve_comments— Keep HTML comments in Markdown output.use_autolinks— Use simple<url>autolink syntax when possible.table_pipe_escape— Replacement string for pipe characters inside table cells.table_caption_side— Show<caption>before (top) or after (bottom) table; empty to suppress.
- Two-tier configuration — Service-level defaults (in
services.yml) are merged with per-instanceconverter_optionspassed tocreateInstance(), so you can set project-wide defaults and override per call. - Drush test command —
drush dlhtm:testfor quick testing conversions from the command line (inline HTML or local file, all converter options as flags, output to stdout or file).
When and why would someone use this module?
- You use Document Loader and need to convert HTML to Markdown for AI/LLM pipelines, vector stores, or search indexing.
- You want clean, token-efficient Markdown from HTML sources (web pages, CMS content, email).
- You are building RAG or retrieval pipelines and need a reusable "HTML in → Markdown out" step.
- You need to chain with the HTML Processor plugin (or other): clean HTML first, then convert to Markdown.
Post-Installation
How does this module actually work once I install it?
- Use in code — Get the plugin from the Document Loader manager and call
load()with an HtmlInput. The result is MarkdownLoaderOutput withgetContent()andgetMetadata(). - Pass configuration per instance — Override converter defaults when creating the plugin by passing
converter_optionstocreateInstance():
$loader = $plugin_manager->createInstance('document_loader_html_to_markdown.html_to_markdown', [ 'converter_options' => [ 'header_style' => 'setext', 'list_item_style' => '*', ], ]); - Drush command — Test conversion from the command line:
drush dlhtm:test --content="<h1>Title</h1><p>Hello <strong>world</strong></p>" drush dlhtm:test --file=/tmp/page.html --out=/tmp/page.md drush dlhtm:test --content="<table><tr><th>A</th></tr><tr><td>1</td></tr></table>"
There is no dedicated config form or new content type. Configuration is passed per load via converter_options or Drush flags.
Additional Requirements
- Document Loader — document_loader. Required; this module is a plugin for it.
- Composer dependencies (installed with the module): league/html-to-markdown.
No external APIs or manual library downloads are required beyond Composer.
Similar projects
- Generic Markdown converters — This module is specifically a Document Loader plugin. It does not replace standalone Markdown libraries; it fits into the Document Loader ecosystem (loaders, input/output types, plugin manager).
- Document Loader — Provides the framework and other loaders (e.g. PDF, text). This project adds one loader that focuses on HTML → Markdown conversion.
- Document Loader HTML Processor — Focuses on HTML → clean HTML. This project focuses on HTML → Markdown. They complement each other.
Community Documentation
- README and CONTRIBUTING — In the project repository; includes usage examples, configuration options, and DDEV setup instructions.
- Issue queue — document_loader_html_to_markdown issues for bugs and feature requests.
Additional information
- Conversion quality — Powered by league/html-to-markdown, which handles headings, paragraphs, bold, italic, links, images, lists, code blocks, horizontal rules, and blockquotes. Table support is enabled by default via TableConverter.
- Service defaults — The module ships with sensible defaults (
header_style: atx,strip_tags: true,remove_nodes: 'script style head',strip_placeholder_links: true) configured inservices.yml. Override per-instance or change the service definition for project-wide defaults.