html_segmenter
Html Segmenter is a utility module for Drupal that helps extract translatable text segments from HTML-rich content, such as long-form body fields or rich text areas. Originally part of the WEB-T module, it has been improved and generalized to support a wide range of use cases and to work cleanly as a standalone dependency.
This module enables translation providers, AI services, or other modules to process only the meaningful plain text content—ignoring HTML tags, links, and dynamic tokens—without breaking the structure of the original document.
Html Segmenter is designed to be reusable and easily integrated with other modules that need HTML field segmentation (e.g., 'ai_translate', 'WEB-T', custom migration pipelines, or content moderation workflows).
Features
- extractTranslatableHtmlValues(), isolates raw translatable text parts from HTML.
- mergeTranslatedHtmlValues(), reassembles translated strings back into the original HTML structure.
- Preserves links, inline tokens, and non-translatable placeholders
- Can be used standalone or as a service by other modules
- Returns both segmented input and optional reintegration after translation
- Ideal for AI-based or external machine translation systems
Usage
Inject or fetch the service:
// Get the service.
$segmenter = \Drupal::service('html_segmenter.segmenter');
// Extract translatable segments from HTML.
$html = 'Hello <img alt="greeting" title="wave"> <b>world</b>!';
$segments = $segmenter->extractTranslatableHtmlValues([$html]);
// $segments might be: ['Hello ', 'greeting', 'wave', 'world', '!']
// After translating the segments (e.g., with an AI service):
$translations = ['Ciao ', 'saluto', 'onda', 'mondo', '!'];
// Merge the translations back into the original HTML.
$merged = $segmenter->mergeTranslatedHtmlValues([$html], $segments, $translations);
// $merged will be: ['Ciao <img alt="saluto" title="onda"> <b>mondo</b>!']
Additional Requirements
No external libraries required
No contrib dependencies
Recommended modules
AI Translate -- now part of Drupal AI module for AI-based translation pipelines
AI Provider WEB-T -- Drupal AI provider to make use of WEB-T e-Translation Services
Similar projects
simplehtmldom API provides low-level DOM parsing, but not focused on segmentation for translation
Html Segmenter is focused and reusable, making it ideal as a shared tool for multilingual and AI-driven projects.
Supporting this Module
This module was created as part of ongoing multilingual and AI development efforts. Contributions and feedback welcome!