document_loader_html_processor
No security coverage
Document Loader HTML Processor adds a Document Loader plugin that turns raw or fetched HTML into clean, configurable HTML. If you use the Document Loader module to load documents (e.g. for AI, search, or migration), this plugin lets you:
- Extract only the parts you care about (e.g.
main,.article-body) using CSS selectors - Strip unwanted bits (scripts, comments, ads) with regex
- Sanitize HTML with configurable rules (allowed tags, attributes)
- Fetch HTML from HTTP/HTTPS URLs, with SSRF protections and optional resolution of relative links
You pass in HTML (or a URL) and options; you get back sanitized HTML suitable for indexing, chunking, or feeding into other Document Loader plugins. The plugin is aimed at developers and site builders who already use or plan to use Document Loader in their pipeline.
Features
What unique features does enabling this project add?
- Container extraction — Extract one or more regions by CSS selector (e.g.
main,.content), in order; multiple selectors are concatenated. - Regex stripping — Remove patterns (e.g.
<script>.…</script>, comments) before sanitization. - Configurable sanitization — Uses Symfony HTML Sanitizer; you can allow/block/drop elements and attributes (e.g. allow only safe elements, or allow specific attributes like
class). - URL fetching — Fetch HTML from a URL with timeout and redirect limits; only HTTP and HTTPS; localhost and private IPs blocked by default (SSRF protection).
- Relative URL resolution — Optionally resolve relative
href/srcto absolute URLs using a base URL (from the fetch URL or a configurablebase_url). - Development override — Optional setting in
settings.phpto allow localhost/private URLs (e.g. for DDEV) in non-production environments.
When and why would someone use this module?
- You use Document Loader and need to clean and narrow HTML before passing it to another loader or to AI/search.
- You want to crawl or ingest HTML from external URLs and normalize links and structure.
- You need consistent sanitization (allowlist of tags/attributes) for HTML coming from multiple sources.
- You are building RAG, search, or migration pipelines and need a reusable “HTML in → clean HTML out” step.
Post-Installation
How does this module actually work once I install it?
- Use in code — Get the plugin from the Document Loader manager and call
load()with an HtmlProcessorInput that hascontentorurland optionalconfig(container, strip_regex, base_url, convert_relative_url, sanitizer). The result is HtmlOutput withgetContent()andgetMetadata(). - Optional: allow local/private URLs in development — Only if you need to fetch from localhost or private IPs (e.g. DDEV), add to
settings.php:$settings['document_loader_html_processor_allow_private_network'] = TRUE;Do not enable this in production.
Additional Requirements
- Drupal — Core ^10.4 || ^11.
- Document Loader — document_loader (^1.0). Required; this module is a plugin for it.
- Composer dependencies (installed with the module): league/uri (^7.0); Symfony components: css-selector, dom-crawler, html-sanitizer, http-client, http-foundation (^6.4 || ^7.0).
Similar projects
- Generic HTML parsers or scrapers — This module is specifically a Document Loader plugin. It does not replace Feeds, Migrate, or custom scrapers; it fits into the Document Loader ecosystem (loaders, input/output types).
- Document Loader — Provides the framework and other loaders (e.g. PDF, text). This project adds one loader that focuses on HTML processing (extract, strip, sanitize, fetch, resolve links).
Additional information
- Security — URL fetching is restricted to HTTP/HTTPS; file:// and localhost are blocked by default. Private and loopback IPs are blocked unless the development setting above is set. Non-2xx responses are treated as failures.
- Refer to https://symfony.com/doc/current/html_sanitizer.html for HTML Sanitizer options