Drupal is a registered trademark of Dries Buytaert
drupal 11.3.7 Update released for Drupal core (11.3.7)! drupal 11.2.11 Update released for Drupal core (11.2.11)! drupal 10.6.7 Update released for Drupal core (10.6.7)! drupal 10.5.9 Update released for Drupal core (10.5.9)! cms 2.1.1 Update released for Drupal core (2.1.1)! drupal 11.3.6 Update released for Drupal core (11.3.6)! drupal 10.6.6 Update released for Drupal core (10.6.6)! cms 2.1.0 Update released for Drupal core (2.1.0)! bootstrap 8.x-3.40 Minor update available for theme bootstrap (8.x-3.40). menu_link_attributes 8.x-1.7 Minor update available for module menu_link_attributes (8.x-1.7). eca 3.1.1 Minor update available for module eca (3.1.1). layout_paragraphs 2.1.3 Minor update available for module layout_paragraphs (2.1.3). ai 1.3.3 Minor update available for module ai (1.3.3). ai 1.2.14 Minor update available for module ai (1.2.14). node_revision_delete 2.0.3 Minor update available for module node_revision_delete (2.0.3). moderated_content_bulk_publish 2.0.52 Minor update available for module moderated_content_bulk_publish (2.0.52). klaro 3.0.10 Minor update available for module klaro (3.0.10). klaro 3.0.9 Minor update available for module klaro (3.0.9). layout_paragraphs 2.1.2 Minor update available for module layout_paragraphs (2.1.2). geofield_map 11.1.8 Minor update available for module geofield_map (11.1.8).

ai_search_sc

No security coverage
View on drupal.org

The AI Search - Semantic Chunking module adds a Semantic Embedding Strategy to AI Search as a drop-in alternative to the built-in token-based chunkers. Chunks are split at embedding-similarity breakpoints instead of fixed token windows, which tends to keep topically related sentences together.

When to use it. Prefer this strategy when:

  • Content is long-form prose (docs, articles, guides) where topic shifts are a better chunk boundary than a fixed token count.
  • You accept a higher indexing-time embedding cost in exchange for more coherent retrieval chunks.

Stay on the token-based strategy when:

  • Documents are short, structured, or highly uniform (product catalogs, short FAQ entries).
  • Embedding-call cost is a constraint and documents are very large.

How it works. Per document:

  1. Markdown clean-up (MarkdownAwareSentenceSplitter, enabled by default as a service decorator around SentenceSplitterInterface): strips setext underlines, ATX # markers, horizontal rules, and emphasis markers (**, __, *, _); rewrites numbered list markers from 1. to 1) so they do not fragment sentences; then re-merges short un-terminated fragments (table cells, section headings) with the following sentence so they stay as one semantic unit.
  2. Split main content into sentences (paragraph breaks = hard boundaries; abbreviation-aware).
  3. Embed each unique sentence once.
  4. Compute cosine distance between consecutive sentences.
  5. Take the configured percentile of those distances as the split threshold.
  6. Start a new chunk whenever the inter-sentence distance exceeds the threshold and the current chunk has at least min_sentences_per_chunk sentences, or whenever appending the next sentence would exceed max_chunk_chars (hard cap, measured in Unicode characters).
  7. Prepend title + contextual content to each chunk and re-embed the assembled chunk (same shape as the Enriched strategy).

If the embedder fails, returns an unusable shape, or the document has more than max_sentences_for_semantic sentences, the strategy falls back to character-based chunking and skips the per-sentence embedding pass.

Cost model. For a document with N distinct sentences producing M chunks, expect N + M embedding calls per document at indexing time. Duplicate sentences (headings, boilerplate) are deduped inside the embedder. The max_sentences_for_semantic cap is the primary cost-containment knob.

Requirements

This submodule requires the following modules:

An AI embedding provider must be configured for AI Search; the strategy delegates sentence and chunk embedding to whichever provider AI Search is wired to use.

  • RAG Search — the parent project this submodule ships inside. The Semantic Embedding Strategy works against any AI Search index, but if you want end-to-end retrieval-augmented question answering on top of it, enable the parent module as well.

Installation

Enable the submodule the same way as any other Drupal module:

drush en ai_search_sc

For further information, see Installing Drupal Modules.

Configuration

Once enabled, the strategy appears as Semantic Embedding Strategy in the AI Search index configuration. Select it on the relevant index at Search API and configure the fields below.

Setting Default Notes breakpoint_percentile 0.95 Higher = fewer, larger chunks. Range 0.5–0.99. min_sentences_per_chunk 2 Upper-bounded at 20 in the UI; larger values effectively disable semantic splitting. max_chunk_chars 4000 Hard cap on assembled chunk size, in Unicode characters. max_sentences_for_semantic 500 Above this, skip embeddings and use character-based fallback. chunk_size, chunk_min_overlap — Only used by the fallback path. Blank values use model-default / 100 tokens respectively.

When max_chunk_chars is a soft cap

Two cases intentionally exceed the configured max_chunk_chars:

  1. A single sentence longer than the cap. The chunker never splits mid-sentence — the sentence is the semantic unit the embedder operates on — so it is emitted intact and the assembled chunk will exceed the cap.
  2. A title longer than the main-content budget. Titles carry identity and truncating them corrupts retrieval metadata, so oversized titles are kept intact and push the final chunk past the cap.

If either case matters for your corpus, raise max_chunk_chars, shorten source titles, or preprocess source content to split overly long sentences.

Disabling the Markdown-aware decorator

The decorator is registered in ai_search_sc.services.yml and wraps ai_search_sc.splitter transparently. To bypass it (e.g., if source content is already plain text and the clean-up is counterproductive), add a ServiceProvider in a custom module:

namespace Drupal\my_module;

use Drupal\Core\DependencyInjection\ContainerBuilder;
use Drupal\Core\DependencyInjection\ServiceProviderInterface;

final class MyModuleServiceProvider implements ServiceProviderInterface {
  public function register(ContainerBuilder $container): void {
    if ($container->hasDefinition('ai_search_sc.markdown_aware_splitter')) {
      $container->removeDefinition('ai_search_sc.markdown_aware_splitter');
    }
  }
}

Clear caches (drush cr) after adding the provider.

Troubleshooting

Every document produces a single chunk: Lower breakpoint_percentile (e.g. 0.75), or drop min_sentences_per_chunk to 1, or reduce max_chunk_chars.

Indexing is slow / expensive: Lower max_sentences_for_semantic to force the fallback sooner on long documents, or switch back to the token strategy for the affected index.

Snake_case / file_name identifiers in content get mangled: The Markdown-aware decorator strips _text_ as italic emphasis. If your content contains snake_case identifiers you want preserved, disable the decorator in a custom ServiceProvider, or preprocess content to escape underscores before indexing.

FAQ

Q: Does this replace the token-based chunker for every index?

A: No. The strategy is selected per Search API index, so you can run the semantic strategy on long-form indexes and keep the token strategy on short/structured ones.

Q: Can I keep the semantic chunker but skip the Markdown clean-up?

A: Yes. The clean-up is a service decorator and can be removed without touching the chunker. See Disabling the Markdown-aware decorator under Configuration.

Q: What happens when the embedding provider is down?

A: The strategy falls back to character-based chunking for that document (same path used when a document exceeds max_sentences_for_semantic), so indexing continues without the per-sentence embedding pass.

Maintainers

Supporting this Module

Buy me a hot chocolate :)

Activity

Total releases
2
First release
Apr 2026
Latest release
7 hours ago
Release cadence
0 days
Stability
50% stable

Releases

Version Type Release date
1.0.0 Stable Apr 17, 2026
1.0.x-dev Dev Apr 17, 2026