document_loader_html_processor

Document Loader HTML Processor adds a Document Loader plugin that turns raw or fetched HTML into clean, configurable HTML. If you use the Document Loader module to load documents (e.g. for AI, search, or migration), this plugin lets you:

Extract only the parts you care about (e.g. main, .article-body) using CSS selectors
Strip unwanted bits (scripts, comments, ads) with regex
Sanitize HTML with configurable rules (allowed tags, attributes)
Fetch HTML from HTTP/HTTPS URLs, with SSRF protections and optional resolution of relative links

You pass in HTML (or a URL) and options; you get back sanitized HTML suitable for indexing, chunking, or feeding into other Document Loader plugins. The plugin is aimed at developers and site builders who already use or plan to use Document Loader in their pipeline.

Features

What unique features does enabling this project add?

Container extraction — Extract one or more regions by CSS selector (e.g. main, .content), in order; multiple selectors are concatenated.
Regex stripping — Remove patterns (e.g. <script>.…</script>, comments) before sanitization.
Configurable sanitization — Uses Symfony HTML Sanitizer; you can allow/block/drop elements and attributes (e.g. allow only safe elements, or allow specific attributes like class).
URL fetching — Fetch HTML from a URL with timeout and redirect limits; only HTTP and HTTPS; localhost and private IPs blocked by default (SSRF protection).
Relative URL resolution — Optionally resolve relative href/src to absolute URLs using a base URL (from the fetch URL or a configurable base_url).
Development override — Optional setting in settings.php to allow localhost/private URLs (e.g. for DDEV) in non-production environments.

When and why would someone use this module?

You use Document Loader and need to clean and narrow HTML before passing it to another loader or to AI/search.
You want to crawl or ingest HTML from external URLs and normalize links and structure.
You need consistent sanitization (allowlist of tags/attributes) for HTML coming from multiple sources.
You are building RAG, search, or migration pipelines and need a reusable “HTML in → clean HTML out” step.

Post-Installation

How does this module actually work once I install it?

Use in code — Get the plugin from the Document Loader manager and call load() with an HtmlProcessorInput that has content or url and optional config (container, strip_regex, base_url, convert_relative_url, sanitizer). The result is HtmlOutput with getContent() and getMetadata().
Optional: allow local/private URLs in development — Only if you need to fetch from localhost or private IPs (e.g. DDEV), add to settings.php: $settings['document_loader_html_processor_allow_private_network'] = TRUE; Do not enable this in production.

Additional Requirements

Drupal — Core ^10.4 || ^11.
Document Loader — document_loader (^1.0). Required; this module is a plugin for it.
Composer dependencies (installed with the module): league/uri (^7.0); Symfony components: css-selector, dom-crawler, html-sanitizer, http-client, http-foundation (^6.4 || ^7.0).

Similar projects

Generic HTML parsers or scrapers — This module is specifically a Document Loader plugin. It does not replace Feeds, Migrate, or custom scrapers; it fits into the Document Loader ecosystem (loaders, input/output types).
Document Loader — Provides the framework and other loaders (e.g. PDF, text). This project adds one loader that focuses on HTML processing (extract, strip, sanitize, fetch, resolve links).

Additional information

Security — URL fetching is restricted to HTTP/HTTPS; file:// and localhost are blocked by default. Private and loopback IPs are blocked unless the development setting above is set. Non-2xx responses are treated as failures.
Refer to https://symfony.com/doc/current/html_sanitizer.html for HTML Sanitizer options

Version	Type	Release date
1.0.0-rc1	Pre-release	Feb 24, 2026
1.0.x-dev	Dev	Feb 24, 2026

Features

Post-Installation

Additional Requirements

Similar projects

Activity

Releases