doc_html_chunker
An AI Automators plugin that splits a text_long HTML field into JSON-encoded chunks for downstream AI processing.
Large documents cannot be sent to an AI model in one request. This module breaks the HTML into manageable pieces using a configurable strategy, stores them as a JSON array in a target field, and lets the AI process each chunk independently before the results are reassembled.
Features
- Four chunking algorithms to match different document structures (see below)
- Configurable maximum chunk size and minimum chunk size (short fragments are discarded automatically)
- Uses PHP DOMDocument for structurally aware splitting — output is always valid HTML
- Full UTF-8 safety: wraps input in a charset meta tag before parsing, preventing DOMDocument's default Latin-1 assumption from corrupting em-dashes, smart quotes, and other multibyte characters
- Output is a single JSON-encoded array stored in the target field, ready for queue-based AI processing
- Integrates with the AI Document Proofreader workflow: sets
field_ai_doc_proofread_statustochunkingafter storing chunks, which automatically triggers the proofreading queue
Chunking algorithms
By heading (recommended)
Splits at every <h1>–<h4> boundary. Each heading and the content that follows it becomes one chunk. Best for structured documents such as reports, manuals, and articles.
By paragraph
Splits at every top-level block element. Each paragraph, list, table, or other block becomes its own chunk. Produces many small chunks; suited to documents with very long sections.
By size
Splits mechanically at the configured character limit, always breaking at a tag boundary so the output remains valid HTML. Useful as a fallback for unstructured documents.
By heading, then size
Splits at headings first, then further splits any section that exceeds the maximum chunk size using the by-size algorithm. Combines structural awareness with a hard size cap — good for documents with a mix of short and very long sections.
Requirements
- Drupal 10.4, 11, or 12
- AI module with the AI Automators sub-module enabled
- PHP DOM and libxml extensions (enabled by default in most PHP distributions)
Installation
drush en doc_html_chunkerNo additional configuration is required after installation. Set up the automator on any text_long field through the standard AI Automators field settings UI.
Related modules
- AI — required base module and automator framework
- AI Automator: Pandoc — converts Word/PDF files to HTML to feed into this chunker
- AI Document Proofreader — full node-based AI proofreading workflow that uses this module for chunk preparation