ai_search_sc
The AI Search - Semantic Chunking module adds a Semantic Embedding Strategy to AI Search as a drop-in alternative to the built-in token-based chunkers. Chunks are split at embedding-similarity breakpoints instead of fixed token windows, which tends to keep topically related sentences together.
When to use it. Prefer this strategy when:
- Content is long-form prose (docs, articles, guides) where topic shifts are a better chunk boundary than a fixed token count.
- You accept a higher indexing-time embedding cost in exchange for more coherent retrieval chunks.
Stay on the token-based strategy when:
- Documents are short, structured, or highly uniform (product catalogs, short FAQ entries).
- Embedding-call cost is a constraint and documents are very large.
How it works. Per document:
- Markdown clean-up (
MarkdownAwareSentenceSplitter, enabled by default as a service decorator aroundSentenceSplitterInterface): strips setext underlines, ATX#markers, horizontal rules, and emphasis markers (**,__,*,_); rewrites numbered list markers from1.to1)so they do not fragment sentences; then re-merges short un-terminated fragments (table cells, section headings) with the following sentence so they stay as one semantic unit. - Split main content into sentences (paragraph breaks = hard boundaries; abbreviation-aware).
- Embed each unique sentence once.
- Compute cosine distance between consecutive sentences.
- Take the configured percentile of those distances as the split threshold.
- Start a new chunk whenever the inter-sentence distance exceeds the threshold and the current chunk has at least
min_sentences_per_chunksentences, or whenever appending the next sentence would exceedmax_chunk_chars(hard cap, measured in Unicode characters). - Prepend title + contextual content to each chunk and re-embed the assembled chunk (same shape as the Enriched strategy).
If the embedder fails, returns an unusable shape, or the document has more than max_sentences_for_semantic sentences, the strategy falls back to character-based chunking and skips the per-sentence embedding pass.
Cost model. For a document with N distinct sentences producing M chunks, expect N + M embedding calls per document at indexing time. Duplicate sentences (headings, boilerplate) are deduped inside the embedder. The max_sentences_for_semantic cap is the primary cost-containment knob.
Requirements
This submodule requires the following modules:
An AI embedding provider must be configured for AI Search; the strategy delegates sentence and chunk embedding to whichever provider AI Search is wired to use.
Recommended modules
- RAG Search — the parent project this submodule ships inside. The Semantic Embedding Strategy works against any AI Search index, but if you want end-to-end retrieval-augmented question answering on top of it, enable the parent module as well.
Installation
Enable the submodule the same way as any other Drupal module:
drush en ai_search_scFor further information, see Installing Drupal Modules.
Configuration
Once enabled, the strategy appears as Semantic Embedding Strategy in the AI Search index configuration. Select it on the relevant index at Search API and configure the fields below.
Setting Default Notesbreakpoint_percentile
0.95
Higher = fewer, larger chunks. Range 0.5–0.99.
min_sentences_per_chunk
2
Upper-bounded at 20 in the UI; larger values effectively disable semantic splitting.
max_chunk_chars
4000
Hard cap on assembled chunk size, in Unicode characters.
max_sentences_for_semantic
500
Above this, skip embeddings and use character-based fallback.
chunk_size, chunk_min_overlap
—
Only used by the fallback path. Blank values use model-default / 100 tokens respectively.
When max_chunk_chars is a soft cap
Two cases intentionally exceed the configured max_chunk_chars:
- A single sentence longer than the cap. The chunker never splits mid-sentence — the sentence is the semantic unit the embedder operates on — so it is emitted intact and the assembled chunk will exceed the cap.
- A title longer than the main-content budget. Titles carry identity and truncating them corrupts retrieval metadata, so oversized titles are kept intact and push the final chunk past the cap.
If either case matters for your corpus, raise max_chunk_chars, shorten source titles, or preprocess source content to split overly long sentences.
Disabling the Markdown-aware decorator
The decorator is registered in ai_search_sc.services.yml and wraps ai_search_sc.splitter transparently. To bypass it (e.g., if source content is already plain text and the clean-up is counterproductive), add a ServiceProvider in a custom module:
namespace Drupal\my_module; use Drupal\Core\DependencyInjection\ContainerBuilder; use Drupal\Core\DependencyInjection\ServiceProviderInterface; final class MyModuleServiceProvider implements ServiceProviderInterface { public function register(ContainerBuilder $container): void { if ($container->hasDefinition('ai_search_sc.markdown_aware_splitter')) { $container->removeDefinition('ai_search_sc.markdown_aware_splitter'); } } }
Clear caches (drush cr) after adding the provider.
Troubleshooting
Every document produces a single chunk: Lower breakpoint_percentile (e.g. 0.75), or drop min_sentences_per_chunk to 1, or reduce max_chunk_chars.
Indexing is slow / expensive: Lower max_sentences_for_semantic to force the fallback sooner on long documents, or switch back to the token strategy for the affected index.
Snake_case / file_name identifiers in content get mangled: The Markdown-aware decorator strips _text_ as italic emphasis. If your content contains snake_case identifiers you want preserved, disable the decorator in a custom ServiceProvider, or preprocess content to escape underscores before indexing.
FAQ
Q: Does this replace the token-based chunker for every index?
A: No. The strategy is selected per Search API index, so you can run the semantic strategy on long-form indexes and keep the token strategy on short/structured ones.
Q: Can I keep the semantic chunker but skip the Markdown clean-up?
A: Yes. The clean-up is a service decorator and can be removed without touching the chunker. See Disabling the Markdown-aware decorator under Configuration.
Q: What happens when the embedding provider is down?
A: The strategy falls back to character-based chunking for that document (same path used when a document exceeds max_sentences_for_semantic), so indexing continues without the per-sentence embedding pass.
Maintainers
- Dany Almeida Kairouz - dany.almeida.kairouz