Drupal is a registered trademark of Dries Buytaert

Search API Japanese Tokenizer is a Drupal module that segments and indexes Japanese text at the word level. By default, Drupal's standard search and the Search API module use N-gram segmentation, which can be imprecise for Japanese. This module improves search performance using advanced natural language processing without requiring external search engines like Apache Solr or Elasticsearch.

Japanese differs from English and many other languages in that there are no spaces between words. For example, in English, This is a pen. clearly separates words with spaces, making tokenization straightforward. However, in Japanese, これはペンです。 does not include spaces, making it challenging to determine word boundaries. Therefore, a specialized natural language processing technique is required to segment text properly.

By default, Drupal's standard search and the Search API module use N-gram segmentation, which has the following issues:

  • It is practically impossible to index and search for single-character words.
  • Unintended articles may appear in search results.
  • It only supports exact phrase matching.

The Search API Japanese Tokenizer module addresses these issues, enhancing Drupal’s search capabilities and improving the accuracy of Japanese search queries without relying on external search engines.

Features

  • Enables indexing and searching at the single-character level
  • Improves search accuracy by indexing at the word level
  • Uses machine learning-based tokenization
  • Supports index exclusion based on character type (TinySegmenter only)
  • Resolves variations in spelling when using morphological analysis
  • Allows exclusion of index entries based on part of speech when using morphological analysis

Post-Installation

  1. Enable the Search API module

    This module requires the Search API module to be enabled.
  2. Install the module

    Enable the search_api_japanese_tokenizer module.
  3. Configure the search server and index

    Go to /admin/config/search/search-api and configure the search server and search index.
  4. Select the tokenizer

    In the Search API processor settings, choose one of the following tokenizers:
    • TinySegmenter tokenizer
    • MeCab tokenizer
    • Sudachi tokenizer
  5. Disable the default Tokenizer processor

    The default Tokenizer processor should be disabled, as enabling it may cause incorrect indexing.

Additional Requirements

This module requires the following components:

  • Search API module
  • Optional morphological analysis engines:
    • TinySegmenter (no additional installation required)
    • MeCab (must be installed on the server)
    • Sudachi (must be installed on the server; since it is written in Java, a JRE is required)

If using MeCab or Sudachi, they must be installed on the server beforehand.

  • Search API Japanese Normalizer: helps improve search accuracy by normalizing text, including unifying hiragana and katakana and converting between full-width and half-width characters.

Related articles (in Japanese)

Activity

Total releases
5
First release
Feb 2025
Latest release
1 year ago
Release cadence
5 days
Stability
0% stable

Release Timeline

Releases

Version Type Release date
1.0.0-alpha4 Pre-release Feb 20, 2025
1.0.0-alpha3 Pre-release Feb 18, 2025
1.0.0-alpha2 Pre-release Feb 3, 2025
1.0.0-alpha1 Pre-release Feb 2, 2025
1.0.x-dev Dev Feb 2, 2025