Drupal is a registered trademark of Dries Buytaert
drupal 11.3.8 Update released for Drupal core (11.3.8)! drupal 11.3.7 Update released for Drupal core (11.3.7)! drupal 11.2.11 Update released for Drupal core (11.2.11)! drupal 10.6.7 Update released for Drupal core (10.6.7)! drupal 10.5.9 Update released for Drupal core (10.5.9)! cms 2.1.1 Update released for Drupal core (2.1.1)! drupal 11.3.6 Update released for Drupal core (11.3.6)! drupal 10.6.6 Update released for Drupal core (10.6.6)! cms 2.1.0 Update released for Drupal core (2.1.0)! linkit 7.0.14 Minor update available for module linkit (7.0.14). masquerade 8.x-2.2 Minor update available for module masquerade (8.x-2.2). video_embed_field 3.1.0 Minor update available for module video_embed_field (3.1.0). bootstrap 8.x-3.40 Minor update available for theme bootstrap (8.x-3.40). menu_link_attributes 8.x-1.7 Minor update available for module menu_link_attributes (8.x-1.7). editoria11y 3.0.0 Major update available for module editoria11y (3.0.0). trash 3.0.27 Minor update available for module trash (3.0.27). twig_tweak 4.0.0-alpha2 New alpha version released for module twig_tweak (4.0.0-alpha2). twig_tweak 4.0.0-alpha1 First alpha version released for module twig_tweak (4.0.0-alpha1). node_revision_delete 2.1.1 Minor update available for module node_revision_delete (2.1.1). commerce_paypal 2.1.2 Minor update available for module commerce_paypal (2.1.2).

An AI Automators plugin that splits a text_long HTML field into JSON-encoded chunks for downstream AI processing.

Large documents cannot be sent to an AI model in one request. This module breaks the HTML into manageable pieces using a configurable strategy, stores them as a JSON array in a target field, and lets the AI process each chunk independently before the results are reassembled.

Features

  • Four chunking algorithms to match different document structures (see below)
  • Configurable maximum chunk size and minimum chunk size (short fragments are discarded automatically)
  • Uses PHP DOMDocument for structurally aware splitting — output is always valid HTML
  • Full UTF-8 safety: wraps input in a charset meta tag before parsing, preventing DOMDocument's default Latin-1 assumption from corrupting em-dashes, smart quotes, and other multibyte characters
  • Output is a single JSON-encoded array stored in the target field, ready for queue-based AI processing
  • Integrates with the AI Document Proofreader workflow: sets field_ai_doc_proofread_status to chunking after storing chunks, which automatically triggers the proofreading queue

Chunking algorithms

By heading (recommended)

Splits at every <h1><h4> boundary. Each heading and the content that follows it becomes one chunk. Best for structured documents such as reports, manuals, and articles.

By paragraph

Splits at every top-level block element. Each paragraph, list, table, or other block becomes its own chunk. Produces many small chunks; suited to documents with very long sections.

By size

Splits mechanically at the configured character limit, always breaking at a tag boundary so the output remains valid HTML. Useful as a fallback for unstructured documents.

By heading, then size

Splits at headings first, then further splits any section that exceeds the maximum chunk size using the by-size algorithm. Combines structural awareness with a hard size cap — good for documents with a mix of short and very long sections.

Requirements

  • Drupal 10.4, 11, or 12
  • AI module with the AI Automators sub-module enabled
  • PHP DOM and libxml extensions (enabled by default in most PHP distributions)

Installation

drush en doc_html_chunker

No additional configuration is required after installation. Set up the automator on any text_long field through the standard AI Automators field settings UI.

Related modules

  • AI — required base module and automator framework
  • AI Automator: Pandoc — converts Word/PDF files to HTML to feed into this chunker
  • AI Document Proofreader — full node-based AI proofreading workflow that uses this module for chunk preparation

Activity

Total releases
1
First release
Mar 2026
Latest release
1 month ago
Release cadence
Stability
0% stable

Releases

Version Type Release date
1.0.x-dev Dev Mar 9, 2026