search_api_japanese_tokenizer
Search API Japanese Tokenizer is a Drupal module that segments and indexes Japanese text at the word level. By default, Drupal's standard search and the Search API module use N-gram segmentation, which can be imprecise for Japanese. This module improves search performance using advanced natural language processing without requiring external search engines like Apache Solr or Elasticsearch.
Japanese differs from English and many other languages in that there are no spaces between words. For example, in English, This is a pen. clearly separates words with spaces, making tokenization straightforward. However, in Japanese, これはペンです。 does not include spaces, making it challenging to determine word boundaries. Therefore, a specialized natural language processing technique is required to segment text properly.
By default, Drupal's standard search and the Search API module use N-gram segmentation, which has the following issues:
- It is practically impossible to index and search for single-character words.
- Unintended articles may appear in search results.
- It only supports exact phrase matching.
The Search API Japanese Tokenizer module addresses these issues, enhancing Drupal’s search capabilities and improving the accuracy of Japanese search queries without relying on external search engines.
Features
- Enables indexing and searching at the single-character level
- Improves search accuracy by indexing at the word level
- Uses machine learning-based tokenization
- Supports index exclusion based on character type (TinySegmenter only)
- Resolves variations in spelling when using morphological analysis
- Allows exclusion of index entries based on part of speech when using morphological analysis
Post-Installation
- Enable the Search API module
This module requires the Search API module to be enabled. - Install the module
Enable thesearch_api_japanese_tokenizermodule. - Configure the search server and index
Go to/admin/config/search/search-apiand configure the search server and search index. - Select the tokenizer
In the Search API processor settings, choose one of the following tokenizers:- TinySegmenter tokenizer
- MeCab tokenizer
- Sudachi tokenizer
- Disable the default Tokenizer processor
The default Tokenizer processor should be disabled, as enabling it may cause incorrect indexing.
Additional Requirements
This module requires the following components:
- Search API module
- Optional morphological analysis engines:
- TinySegmenter (no additional installation required)
- MeCab (must be installed on the server)
- Sudachi (must be installed on the server; since it is written in Java, a JRE is required)
If using MeCab or Sudachi, they must be installed on the server beforehand.
Recommended modules/libraries
- Search API Japanese Normalizer: helps improve search accuracy by normalizing text, including unifying hiragana and katakana and converting between full-width and half-width characters.