Drupal is a registered trademark of Dries Buytaert
drupal 11.3.7 Update released for Drupal core (11.3.7)! drupal 11.2.11 Update released for Drupal core (11.2.11)! drupal 10.6.7 Update released for Drupal core (10.6.7)! drupal 10.5.9 Update released for Drupal core (10.5.9)! cms 2.1.1 Update released for Drupal core (2.1.1)! drupal 11.3.6 Update released for Drupal core (11.3.6)! drupal 10.6.6 Update released for Drupal core (10.6.6)! cms 2.1.0 Update released for Drupal core (2.1.0)! bootstrap 8.x-3.40 Minor update available for theme bootstrap (8.x-3.40). menu_link_attributes 8.x-1.7 Minor update available for module menu_link_attributes (8.x-1.7). eca 3.1.1 Minor update available for module eca (3.1.1). layout_paragraphs 2.1.3 Minor update available for module layout_paragraphs (2.1.3). ai 1.3.3 Minor update available for module ai (1.3.3). ai 1.2.14 Minor update available for module ai (1.2.14). node_revision_delete 2.0.3 Minor update available for module node_revision_delete (2.0.3). moderated_content_bulk_publish 2.0.52 Minor update available for module moderated_content_bulk_publish (2.0.52). klaro 3.0.10 Minor update available for module klaro (3.0.10). klaro 3.0.9 Minor update available for module klaro (3.0.9). layout_paragraphs 2.1.2 Minor update available for module layout_paragraphs (2.1.2). geofield_map 11.1.8 Minor update available for module geofield_map (11.1.8).

Search API Japanese Tokenizer is a Drupal module that segments and indexes Japanese text at the word level. By default, Drupal's standard search and the Search API module use N-gram segmentation, which can be imprecise for Japanese. This module improves search performance using advanced natural language processing without requiring external search engines like Apache Solr or Elasticsearch.

Japanese differs from English and many other languages in that there are no spaces between words. For example, in English, This is a pen. clearly separates words with spaces, making tokenization straightforward. However, in Japanese, これはペンです。 does not include spaces, making it challenging to determine word boundaries. Therefore, a specialized natural language processing technique is required to segment text properly.

By default, Drupal's standard search and the Search API module use N-gram segmentation, which has the following issues:

  • It is practically impossible to index and search for single-character words.
  • Unintended articles may appear in search results.
  • It only supports exact phrase matching.

The Search API Japanese Tokenizer module addresses these issues, enhancing Drupal’s search capabilities and improving the accuracy of Japanese search queries without relying on external search engines.

Features

  • Enables indexing and searching at the single-character level
  • Improves search accuracy by indexing at the word level
  • Uses machine learning-based tokenization
  • Supports index exclusion based on character type (TinySegmenter only)
  • Resolves variations in spelling when using morphological analysis
  • Allows exclusion of index entries based on part of speech when using morphological analysis

Post-Installation

  1. Enable the Search API module

    This module requires the Search API module to be enabled.
  2. Install the module

    Enable the search_api_japanese_tokenizer module.
  3. Configure the search server and index

    Go to /admin/config/search/search-api and configure the search server and search index.
  4. Select the tokenizer

    In the Search API processor settings, choose one of the following tokenizers:
    • TinySegmenter tokenizer
    • MeCab tokenizer
    • Sudachi tokenizer
  5. Disable the default Tokenizer processor

    The default Tokenizer processor should be disabled, as enabling it may cause incorrect indexing.

Additional Requirements

This module requires the following components:

  • Search API module
  • Optional morphological analysis engines:
    • TinySegmenter (no additional installation required)
    • MeCab (must be installed on the server)
    • Sudachi (must be installed on the server; since it is written in Java, a JRE is required)

If using MeCab or Sudachi, they must be installed on the server beforehand.

  • Search API Japanese Normalizer: helps improve search accuracy by normalizing text, including unifying hiragana and katakana and converting between full-width and half-width characters.

Related articles (in Japanese)

Activity

Total releases
5
First release
Feb 2025
Latest release
1 year ago
Release cadence
5 days
Stability
0% stable

Release Timeline

Releases

Version Type Release date
1.0.0-alpha4 Pre-release Feb 20, 2025
1.0.0-alpha3 Pre-release Feb 18, 2025
1.0.0-alpha2 Pre-release Feb 3, 2025
1.0.0-alpha1 Pre-release Feb 2, 2025
1.0.x-dev Dev Feb 2, 2025