Drupal is a registered trademark of Dries Buytaert

Document Loader HTML to Markdown adds a Document Loader plugin that converts HTML into clean Markdown. If you use the Document Loader module to load documents (e.g. for AI, search, RAG pipelines, or migration), this plugin lets you:

  • Convert HTML (headings, paragraphs, bold, italic, links, images, lists, code blocks) to standard Markdown
  • Convert tables — HTML <table> elements are converted to Markdown pipe tables by default
  • Configure conversion options per-instance (header style, list style, tag stripping, node removal, autolinks, and more)
  • Chain with Document Loader HTML Processor — use its cleaned HTML output as input for this plugin

You pass in HTML via DocumentLoaderInputInterface; you get back MarkdownLoaderOutput with getContent() and getMetadata(). The plugin is aimed at developers and site builders who already use or plan to use Document Loader in their pipeline.

Features

What unique features does enabling this project add?

  • Table conversion — HTML tables are converted to Markdown pipe tables out of the box (TableConverter is registered by default).
  • Configurable converter options — All non-deprecated league/html-to-markdown options can be set as service-level defaults or per-instance via converter_options:
    • header_styleatx (# headings) or setext (underline headings). Default: atx.
    • strip_tags — Remove HTML tags without Markdown equivalents, keeping their text. Default: true.
    • remove_nodes — Space-separated list of tags to remove entirely (content included). Default: script style head.
    • strip_placeholder_links — Remove <a> tags without href. Default: true.
    • hard_break — Convert <br> to \n instead of two trailing spaces + \n.
    • suppress_errors — Suppress libxml warnings when loading malformed HTML.
    • list_item_style — List bullet character: -, *, or +.
    • preserve_comments — Keep HTML comments in Markdown output.
    • use_autolinks — Use simple <url> autolink syntax when possible.
    • table_pipe_escape — Replacement string for pipe characters inside table cells.
    • table_caption_side — Show <caption> before (top) or after (bottom) table; empty to suppress.
  • Two-tier configuration — Service-level defaults (in services.yml) are merged with per-instance converter_options passed to createInstance(), so you can set project-wide defaults and override per call.
  • Drush test commanddrush dlhtm:test for quick testing conversions from the command line (inline HTML or local file, all converter options as flags, output to stdout or file).

When and why would someone use this module?

  • You use Document Loader and need to convert HTML to Markdown for AI/LLM pipelines, vector stores, or search indexing.
  • You want clean, token-efficient Markdown from HTML sources (web pages, CMS content, email).
  • You are building RAG or retrieval pipelines and need a reusable "HTML in → Markdown out" step.
  • You need to chain with the HTML Processor plugin (or other): clean HTML first, then convert to Markdown.

Post-Installation

How does this module actually work once I install it?

  1. Use in code — Get the plugin from the Document Loader manager and call load() with an HtmlInput. The result is MarkdownLoaderOutput with getContent() and getMetadata().
  2. Pass configuration per instance — Override converter defaults when creating the plugin by passing converter_options to createInstance():
    $loader = $plugin_manager->createInstance('document_loader_html_to_markdown.html_to_markdown', [
      'converter_options' => [
        'header_style' => 'setext',
        'list_item_style' => '*',
      ],
    ]);
  3. Drush command — Test conversion from the command line:
    drush dlhtm:test --content="<h1>Title</h1><p>Hello <strong>world</strong></p>"
    drush dlhtm:test --file=/tmp/page.html --out=/tmp/page.md
    drush dlhtm:test --content="<table><tr><th>A</th></tr><tr><td>1</td></tr></table>"

There is no dedicated config form or new content type. Configuration is passed per load via converter_options or Drush flags.

Additional Requirements

No external APIs or manual library downloads are required beyond Composer.

Similar projects

  • Generic Markdown converters — This module is specifically a Document Loader plugin. It does not replace standalone Markdown libraries; it fits into the Document Loader ecosystem (loaders, input/output types, plugin manager).
  • Document Loader — Provides the framework and other loaders (e.g. PDF, text). This project adds one loader that focuses on HTML → Markdown conversion.
  • Document Loader HTML Processor — Focuses on HTML → clean HTML. This project focuses on HTML → Markdown. They complement each other.

Community Documentation

Additional information

  • Conversion quality — Powered by league/html-to-markdown, which handles headings, paragraphs, bold, italic, links, images, lists, code blocks, horizontal rules, and blockquotes. Table support is enabled by default via TableConverter.
  • Service defaults — The module ships with sensible defaults (header_style: atx, strip_tags: true, remove_nodes: 'script style head', strip_placeholder_links: true) configured in services.yml. Override per-instance or change the service definition for project-wide defaults.

Activity

Total releases
1
First release
Feb 2026
Latest release
1 week ago
Release cadence
Stability
0% stable

Releases

Version Type Release date
1.0.x-dev Dev Feb 25, 2026