Drupal is a registered trademark of Dries Buytaert
drupal 10.6.12 Update released for Drupal core (10.6.12)! drupal 11.3.13 Update released for Drupal core (11.3.13)! drupal 10.6.11 Update released for Drupal core (10.6.11)! drupal 11.3.12 Update released for Drupal core (11.3.12)! drupal 11.2.14 Update released for Drupal core (11.2.14)! drupal 10.5.12 Update released for Drupal core (10.5.12)! cms 2.1.3 Update released for Drupal core (2.1.3)! drupal 10.5.11 Update released for Drupal core (10.5.11)! drupal 11.3.11 Update released for Drupal core (11.3.11)! drupal 11.2.13 Update released for Drupal core (11.2.13)! drupal 10.6.10 Update released for Drupal core (10.6.10)! cms 2.1.2 Update released for Drupal core (2.1.2)! drupal 11.1.10 Update released for Drupal core (11.1.10)! drupal 10.5.10 Update released for Drupal core (10.5.10)! drupal 10.4.10 Update released for Drupal core (10.4.10)! drupal 11.2.12 Update released for Drupal core (11.2.12)! drupal 11.3.10 Update released for Drupal core (11.3.10)! drupal 10.6.9 Update released for Drupal core (10.6.9)! drupal 10.6.8 Update released for Drupal core (10.6.8)! drupal 11.3.9 Update released for Drupal core (11.3.9)!

Clean up messy HTML from any source — extract the content you want, strip ads and noise, fix links, and sanitize — before you store, convert, or index it.

Feed a Markdown converter or search index a raw web page and you get navigation, ads, and boilerplate in the result. HTML Processor removes that noise first, so downstream output stays clean and token-efficient. It runs a configurable pipeline from a single service call or a saved default, and is standalone: no other Drupal modules required, just a few Symfony/League libraries that Composer installs for you.

What it does

  • Extracts the content you want with CSS selectors (article, #main-content) and drops the rest.
  • Removes ads and boilerplate — built-in patterns for common networks, plus your own.
  • Strips unwanted fragments with admin-trusted regex (guarded against ReDoS).
  • Rewrites relative links and images to absolute URLs so they keep working.
  • Sanitizes elements and attributes via the Symfony HTML Sanitizer.
  • Shapes output — wrap as a full document, or minify.

Pass options per call, or save a default pipeline in the admin form and have it applied automatically.

Use cases

Cleaning HTML before Markdown conversion, AI/RAG ingestion, migrations, or search indexing — anywhere you pull content from sources you don't control.

Requirements

Drupal core ^10.4 || ^11. A few small Symfony and League Composer libraries are installed automatically — no other Drupal modules required. (Exact packages are listed in the README.)

Recommended modules

Optional integrations — HTML Processor works on any HTML string on its own:

  • An HTML-to-Markdown loader — convert the cleaned HTML to Markdown for documentation or AI/RAG pipelines.
  • URL Fetcher (url_fetcher) — retrieve remote HTML to feed into the processor.

Install and configure

composer require drupal/html_processor
drush en html_processor -y

Set defaults at Configuration › Content authoring › HTML Processor (permission: Administer HTML Processor settings). Defaults are opt-in, and explicit options passed in code always win.

In code, inject HtmlProcessorInterface and call process():

$clean = $this->htmlProcessor->process([
  'content'    => $rawHtml,
  'container'  => 'article, #main-content',
  'remove_ads' => TRUE,
]);

The full service API, the Drush command, autowiring setup, and security notes are in the module's README.md.

Good to know

  • Container extraction uses explicit CSS selectors — precise, but review them if a source site changes its markup.
  • Headed for Markdown? Turn minify off (it breaks code blocks) and keep sanitization light to preserve structure.
  • Regex stripping and ad patterns are admin-trusted only — never safe for anonymous input.

Similar projects

The standalone successor to the cleaning services in Document Loader: HTML Processor (document_loader_html_processor). Pairs naturally with an HTML-to-Markdown loader downstream.

Activity

Total releases
1
First release
Jun 2026
Latest release
7 hours ago
Release cadence
Stability
0% stable

Releases

Version Type Release date
1.0.0-rc1 Pre-release Jun 29, 2026