נורמליזציית חיפוש בערבית: חוסר תמיכה בווריאציות של המזה, צורות יא/כא, ושקילות אורתוגרפית

Hello Discourse team,

We are running a multilingual forum with significant Arabic and Persian content, and we’ve encountered a critical limitation in the search functionality related to Arabic orthographic normalization.

:magnifying_glass_tilted_left: Problem Description

Arabic script includes multiple Unicode representations for semantically identical characters. Unfortunately, Discourse’s current search engine appears to treat these variants as distinct, which leads to incomplete or misleading search results.

Examples:

  • Searching for إطلاق مقامي only returns exact matches, while posts containing اطلاق مقامي, أطلاق مقامي, or إطلاق‌مقامي are excluded.
  • Similarly, searching for ي (U+064A) does not match ی (U+06CC), and ك (U+0643) fails to match ک (U+06A9), despite their functional equivalence in Arabic/Persian contexts.

This affects not only hamza variants (أ, إ, ء, ؤ, ئ) but also common substitutions like:

Character Unicode Suggested Normalization
أ, إ, ء, آ U+0623, U+0625, U+0621, U+0622 Normalize to ا
ؤ U+0624 Normalize to و
ئ U+0626 Normalize to ي
ى U+0649 Normalize to ي
ة U+0629 Normalize to ه
ي vs ی U+064A vs U+06CC Normalize to ی
ك vs ک U+0643 vs U+06A9 Normalize to ک

This issue is compounded when users omit diacritics or use different keyboard layouts, resulting in fragmented search behavior.


:gear: Proposed Solution

We recommend implementing a Unicode-aware normalization layer during both indexing and query parsing. This can be achieved by:

  1. Preprocessing both indexed content and user queries to unify character variants.
  2. Applying normalization rules similar to those used in Arabic NLP libraries or search engines (e.g., Farasa, Hazm, or custom regex-based mappers).
  3. Optionally, supporting fuzzy matching or Levenshtein distance for near-exact matches.

Here’s a simplified example of a normalization function (Java-style):

public static String normalizeArabic(String text) {
  return text.replace("أ", "ا")
             .replace("إ", "ا")
             .replace("آ", "ا")
             .replace("ؤ", "و")
             .replace("ئ", "ي")
             .replace("ى", "ي")
             .replace("ة", "ه")
             .replace("ي", "ی")
             .replace("ك", "ک");
}

:folded_hands: Request

Could this normalization be considered for inclusion in the core search engine or as a plugin? It would significantly improve usability for Arabic and Persian communities using Discourse.

If there’s an existing workaround or plugin that addresses this, we’d appreciate any guidance.

Thank you for your time and for building such a powerful platform.

Best regards

לייק 1

Does the search_ignore_accents site setting have any effect on this problem?

לייק 1

Thanks so much for jumping in and contributing to the discussion

To answer your question: yes, the search_ignore_accents setting is enabled on our forum.
Unfortunately, it doesn’t resolve the issue we’re facing. The search results still fail to match orthographically equivalent Arabic and Persian characters, so the problem persists despite that setting.

לייק 1

I think this a reasonable request since it would greatly improve the search experience for Arabic and Persian sites. We’d love to review a PR that implements this feature, so I’m going to put a pr-welcome on it.

For anyone who decides to work on this feature: all the normalization logic should be gated behind a site setting that enables this by default for Arabic and Persian sites (see locale_default in site_settings.yml) and all other locales should have this setting off by default. Core already has similar normalization logic for accented characters (see lib/search.rb), so that would be a useful reference when implementing this feature.

4 לייקים

Thank you so much, Osama! I’m really glad to see this suggestion was well received.

2 לייקים

For this part of the problem, we’re talking about a Unicode standard NFKC (to pick one) normalisation?

(I’m not even sure what we do… I assume we normalise post text in the cooking pipeline?)

לייק 1

I’m not a technical expert, but I’ve been researching this issue because I want to ensure that no search queries are missed on a Persian-Arabic bilingual Discourse forum. Since Discourse uses PostgreSQL, normalization becomes essential: a user might search using Persian characters, while the same word is stored using Arabic ones—or vice versa. Without proper normalization, the search will fail.
Based on what I’ve learned, using Unicode NFKC normalization is a solid starting point—it handles many compatibility cases like ligatures, presentation forms, and Arabic/Persian digits.

However, for Persian and Arabic text, NFKC alone is not sufficient. It does not normalize several critical character variants that are visually and semantically equivalent but differ at the binary level.

Below, I’m outlining the procedures and insights I’ve arrived at through my research and exploration.


:wrench: Overall Design Strategy

  1. Apply Unicode NFKC normalization first to handle ligatures, presentation forms, and digit unification.
  2. Then apply custom character mappings in a defined order (e.g., normalize Hamza variants before Arabic Ya).
  3. Separate normalization policies for storage vs. search:
    • Use a Conservative profile for canonical storage (preserve ZWNJ, avoid semantic shifts).
    • Use a Permissive profile for search (ignore ZWNJ, unify Hamza variants, normalize digits).
  4. All mappings should be configurable via a centralized mapping table in the database or a Ruby hash in the application.

:one: Normalization Profiles

:green_circle: Conservative (for storage)

  • Minimal transformation
  • Apply NFKC
  • Normalize Arabic Kaf/Ya to Persian equivalents
  • Remove diacritics
  • Preserve ZWNJ
  • Store as original_text + normalized_conservative

:blue_circle: Permissive (for search)

  • Aggressive matching
  • Apply all Conservative rules
  • Remove/ignore ZWNJ
  • Normalize Hamza variants to base letters
  • Convert all digits to ASCII
  • Optionally unify Taa Marbuta → Heh
  • Used for query preprocessing

:two: Comprehensive Mapping Table

Source Target Unicode Notes
ك ک U+0643 → U+06A9 Arabic Kaf → Persian Kaf
ي ی U+064A → U+06CC Arabic Ya → Persian Ya
ى ی U+0649 → U+06CC Final Ya variant
أ, إ, ٱ ا Various → U+0627 Hamza forms → Alef
ؤ و U+0624 → U+0648 Hamza Waw
ئ ی U+0626 → U+06CC Hamza Ya
ء U+0621 Remove or preserve (configurable)
ة ه U+0629 → U+0647 Taa Marbuta → Heh (optional)
ۀ هٔ U+06C0 ↔ U+0647+U+0654 Normalize composed form
ڭ گ U+06AD → U+06AF Regional variants
U+200C ZWNJ: preserve in Conservative, remove in Permissive
٤, ۴ 4 U+0664, U+06F4 → ASCII Normalize digits
Diacritics U+064B–U+0652 Remove all harakat
ZWJ U+200D Remove invisible joiners
Multiple spaces Single space Normalize spacing

:three: Fast Mapping Snippet (for SQL or Ruby)

ك → ک
ي → ی
ى → ی
أ → ا
إ → ا
ؤ → و
ئ → ی
ة → ه
ۀ → هٔ
ٱ → ا
٤, ۴ → 4
ZWNJ (U+200C) → (removed in permissive)
Harakat (U+064B..U+0652) → removed
ZWJ (U+200D) → removed

:four: Implementation in PostgreSQL

  • Create a text_normalization_map table
  • Use regexp_replace or TRANSLATE chains for performance
  • Optionally implement in PL/Python or PL/v8 for Unicode support
  • Normalize both stored content and incoming queries using the same logic

Indexing Strategy

  • Store normalized_conservative for canonical indexing
  • Normalize queries with normalize_persian_arabic(query, 'permissive')
  • If using permissive search, index must match the same profile
  • Optionally store both versions for cross-compare

:five: Ruby Hash Example (for Discourse)

NORMALIZATION_MAP = {
  "ك" => "ک",
  "ي" => "ی",
  "ى" => "ی",
  "أ" => "ا",
  "إ" => "ا",
  "ٱ" => "ا",
  "ؤ" => "و",
  "ئ" => "ی",
  "ة" => "ه",
  "ۀ" => "هٔ",
  "۴" => "4",
  "٤" => "4",
  "\u200C" => "", # ZWNJ
  "\u200D" => "", # ZWJ
}

:six: Performance and Practical Notes

  1. Apply NFKC in the application layer (e.g., Ruby unicode_normalize(:nfkc))
  2. Use separate indexes for conservative vs. permissive profiles
  3. Avoid forced mappings of semantically sensitive characters (e.g., Hamza, Taa Marbuta) unless explicitly configured
  4. Run A/B tests on real forum data to measure hit rate and false positives
  5. Document each mapping with rationale and examples
  6. Define unit tests in both Ruby and SQL for each mapping

:seven: Final Recommendation

  • Use Unicode NFKC as a base
  • Extend it with a custom mapping layer
  • Maintain dual profiles for storage and search
  • Implement normalization in both the app and database layers
  • Document and test every mapping
  • Build appropriate indexes (GIN + to_tsvector) on normalized columns