search_api_japanese_normalizer
Search API Japanese Normalizer is a module that provides a processor for the Drupal Search API module. This processor standardizes variations in Japanese text, improving search accuracy.
Features
This module normalizes Japanese text variations according to the following rules:
- Convert full-width alphanumeric characters to half-width.
- Convert half-width Katakana to full-width Katakana.
- Normalize characters similar to hyphen-minus.
- Normalize characters similar to the long vowel mark.
- Replace consecutive long vowel marks with a single one.
- Remove characters similar to the tilde (~).
- Convert full-width symbols commonly used in half-width form to half-width.
- Convert half-width symbols commonly used in full-width form to full-width.
- Convert full-width spaces to half-width spaces.
- Replace multiple consecutive half-width spaces with a single one.
- Remove half-width spaces between "Hiragana, full-width Katakana, half-width Katakana, Kanji, and full-width symbols."
- Remove half-width spaces between "Hiragana, full-width Katakana, half-width Katakana, Kanji, full-width symbols" and "half-width alphanumeric characters."
This module is implemented with reference to the normalization rules used in NEologd, a dictionary for morphological analyzers. For detailed conversion rules, please refer to NEologd Normalization Rules.
Example Conversions
Before After ドルーパル ドルーパル スーーパーーー スーパー アルゴリズム C アルゴリズムCPost-Installation
After installation, the "Japanese Normalizer" processor will be added to the "Processors" tab in the Search API index settings. Enabling this processor will automatically correct variations in Japanese text, improving search accuracy.
Additional Requirements
The Search API module is required for this module to function. For setup instructions, please refer to the Search API module documentation.
Recommended modules/libraries
- Search API Japanese Tokenizer: Optimizes search indexes using natural language processing and resolves issues related to N-grams.
Similar projects
- Search API Kana Convert - A module specializing in converting between Hiragana, Katakana, and Romaji representations.