Drupal is a registered trademark of Dries Buytaert
drupal 11.3.7 Update released for Drupal core (11.3.7)! drupal 11.2.11 Update released for Drupal core (11.2.11)! drupal 10.6.7 Update released for Drupal core (10.6.7)! drupal 10.5.9 Update released for Drupal core (10.5.9)! cms 2.1.1 Update released for Drupal core (2.1.1)! drupal 11.3.6 Update released for Drupal core (11.3.6)! drupal 10.6.6 Update released for Drupal core (10.6.6)! cms 2.1.0 Update released for Drupal core (2.1.0)! bootstrap 8.x-3.40 Minor update available for theme bootstrap (8.x-3.40). menu_link_attributes 8.x-1.7 Minor update available for module menu_link_attributes (8.x-1.7). eca 3.1.1 Minor update available for module eca (3.1.1). layout_paragraphs 2.1.3 Minor update available for module layout_paragraphs (2.1.3). ai 1.3.3 Minor update available for module ai (1.3.3). ai 1.2.14 Minor update available for module ai (1.2.14). node_revision_delete 2.0.3 Minor update available for module node_revision_delete (2.0.3). moderated_content_bulk_publish 2.0.52 Minor update available for module moderated_content_bulk_publish (2.0.52). klaro 3.0.10 Minor update available for module klaro (3.0.10). klaro 3.0.9 Minor update available for module klaro (3.0.9). layout_paragraphs 2.1.2 Minor update available for module layout_paragraphs (2.1.2). geofield_map 11.1.8 Minor update available for module geofield_map (11.1.8).

html_segmenter

2 sites No security coverage
View on drupal.org

Html Segmenter is a utility module for Drupal that helps extract translatable text segments from HTML-rich content, such as long-form body fields or rich text areas. Originally part of the WEB-T module, it has been improved and generalized to support a wide range of use cases and to work cleanly as a standalone dependency.

This module enables translation providers, AI services, or other modules to process only the meaningful plain text content—ignoring HTML tags, links, and dynamic tokens—without breaking the structure of the original document.

Html Segmenter is designed to be reusable and easily integrated with other modules that need HTML field segmentation (e.g., 'ai_translate', 'WEB-T', custom migration pipelines, or content moderation workflows).

Features

- extractTranslatableHtmlValues(), isolates raw translatable text parts from HTML.
- mergeTranslatedHtmlValues(), reassembles translated strings back into the original HTML structure.
- Preserves links, inline tokens, and non-translatable placeholders
- Can be used standalone or as a service by other modules
- Returns both segmented input and optional reintegration after translation
- Ideal for AI-based or external machine translation systems

Usage

Inject or fetch the service:

// Get the service.
$segmenter = \Drupal::service('html_segmenter.segmenter');

// Extract translatable segments from HTML.
$html = 'Hello <img alt="greeting" title="wave"> <b>world</b>!';
$segments = $segmenter->extractTranslatableHtmlValues([$html]);
// $segments might be: ['Hello ', 'greeting', 'wave', 'world', '!']

// After translating the segments (e.g., with an AI service):
$translations = ['Ciao ', 'saluto', 'onda', 'mondo', '!'];

// Merge the translations back into the original HTML.
$merged = $segmenter->mergeTranslatedHtmlValues([$html], $segments, $translations);
// $merged will be: ['Ciao <img alt="saluto" title="onda"> <b>mondo</b>!']

Additional Requirements

No external libraries required
No contrib dependencies

Recommended modules

AI Translate -- now part of Drupal AI module for AI-based translation pipelines
AI Provider WEB-T -- Drupal AI provider to make use of WEB-T e-Translation Services

Similar projects

simplehtmldom API provides low-level DOM parsing, but not focused on segmentation for translation
Html Segmenter is focused and reusable, making it ideal as a shared tool for multilingual and AI-driven projects.

Supporting this Module

This module was created as part of ongoing multilingual and AI development efforts. Contributions and feedback welcome!

Activity

Total releases
1
First release
Jul 2025
Latest release
9 months ago
Release cadence
Stability
100% stable

Releases

Version Type Release date
1.0.0 Stable Jul 1, 2025