Drupal is a registered trademark of Dries Buytaert

An AI Automators plugin that splits a text_long HTML field into JSON-encoded chunks for downstream AI processing.

Large documents cannot be sent to an AI model in one request. This module breaks the HTML into manageable pieces using a configurable strategy, stores them as a JSON array in a target field, and lets the AI process each chunk independently before the results are reassembled.

Features

  • Four chunking algorithms to match different document structures (see below)
  • Configurable maximum chunk size and minimum chunk size (short fragments are discarded automatically)
  • Uses PHP DOMDocument for structurally aware splitting — output is always valid HTML
  • Full UTF-8 safety: wraps input in a charset meta tag before parsing, preventing DOMDocument's default Latin-1 assumption from corrupting em-dashes, smart quotes, and other multibyte characters
  • Output is a single JSON-encoded array stored in the target field, ready for queue-based AI processing
  • Integrates with the AI Document Proofreader workflow: sets field_ai_doc_proofread_status to chunking after storing chunks, which automatically triggers the proofreading queue

Chunking algorithms

By heading (recommended)

Splits at every <h1><h4> boundary. Each heading and the content that follows it becomes one chunk. Best for structured documents such as reports, manuals, and articles.

By paragraph

Splits at every top-level block element. Each paragraph, list, table, or other block becomes its own chunk. Produces many small chunks; suited to documents with very long sections.

By size

Splits mechanically at the configured character limit, always breaking at a tag boundary so the output remains valid HTML. Useful as a fallback for unstructured documents.

By heading, then size

Splits at headings first, then further splits any section that exceeds the maximum chunk size using the by-size algorithm. Combines structural awareness with a hard size cap — good for documents with a mix of short and very long sections.

Requirements

  • Drupal 10.4, 11, or 12
  • AI module with the AI Automators sub-module enabled
  • PHP DOM and libxml extensions (enabled by default in most PHP distributions)

Installation

drush en doc_html_chunker

No additional configuration is required after installation. Set up the automator on any text_long field through the standard AI Automators field settings UI.

Related modules

  • AI — required base module and automator framework
  • AI Automator: Pandoc — converts Word/PDF files to HTML to feed into this chunker
  • AI Document Proofreader — full node-based AI proofreading workflow that uses this module for chunk preparation

Activity

Total releases
1
First release
Mar 2026
Latest release
5 days ago
Release cadence
Stability
0% stable

Releases

Version Type Release date
1.0.x-dev Dev Mar 9, 2026