נורמליזציית חיפוש בערבית: חוסר תמיכה בווריאציות של המזה, צורות יא/כא, ושקילות אורתוגרפית

serkhelesheyi · 29 בספטמבר,‏ 2025,‏ 8:27am

Hello Discourse team,

We are running a multilingual forum with significant Arabic and Persian content, and we’ve encountered a critical limitation in the search functionality related to Arabic orthographic normalization.

Problem Description

Arabic script includes multiple Unicode representations for semantically identical characters. Unfortunately, Discourse’s current search engine appears to treat these variants as distinct, which leads to incomplete or misleading search results.

Examples:

Searching for إطلاق مقامي only returns exact matches, while posts containing اطلاق مقامي, أطلاق مقامي, or إطلاق‌مقامي are excluded.
Similarly, searching for ي (U+064A) does not match ی (U+06CC), and ك (U+0643) fails to match ک (U+06A9), despite their functional equivalence in Arabic/Persian contexts.

This affects not only hamza variants (أ, إ, ء, ؤ, ئ) but also common substitutions like:

Character	Unicode	Suggested Normalization
`أ`, `إ`, `ء`, `آ`	U+0623, U+0625, U+0621, U+0622	Normalize to `ا`
`ؤ`	U+0624	Normalize to `و`
`ئ`	U+0626	Normalize to `ي`
`ى`	U+0649	Normalize to `ي`
`ة`	U+0629	Normalize to `ه`
`ي` vs `ی`	U+064A vs U+06CC	Normalize to `ی`
`ك` vs `ک`	U+0643 vs U+06A9	Normalize to `ک`

This issue is compounded when users omit diacritics or use different keyboard layouts, resulting in fragmented search behavior.

Proposed Solution

We recommend implementing a Unicode-aware normalization layer during both indexing and query parsing. This can be achieved by:

Preprocessing both indexed content and user queries to unify character variants.
Applying normalization rules similar to those used in Arabic NLP libraries or search engines (e.g., Farasa, Hazm, or custom regex-based mappers).
Optionally, supporting fuzzy matching or Levenshtein distance for near-exact matches.

Here’s a simplified example of a normalization function (Java-style):

public static String normalizeArabic(String text) {
  return text.replace("أ", "ا")
             .replace("إ", "ا")
             .replace("آ", "ا")
             .replace("ؤ", "و")
             .replace("ئ", "ي")
             .replace("ى", "ي")
             .replace("ة", "ه")
             .replace("ي", "ی")
             .replace("ك", "ک");
}

Request

Could this normalization be considered for inclusion in the core search engine or as a plugin? It would significantly improve usability for Arabic and Persian communities using Discourse.

If there’s an existing workaround or plugin that addresses this, we’d appreciate any guidance.

Thank you for your time and for building such a powerful platform.

Best regards

Moin · 29 בספטמבר,‏ 2025,‏ 8:30am

Does the search_ignore_accents site setting have any effect on this problem?

serkhelesheyi · 29 בספטמבר,‏ 2025,‏ 8:42am

תודה רבה שהצטרפת ותורמת לדיון

כדי לענות על שאלתך: כן, ההגדרה search_ignore_accents מופעלת בפורום שלנו.
למרבה הצער, זה לא פותר את הבעיה שאנו מתמודדים איתה. תוצאות החיפוש עדיין לא תואמות תווים ערביים ופרסיים שווים אורתוגרפית, כך שהבעיה נמשכת למרות ההגדרה הזו.

Osama · 1 באוקטובר,‏ 2025,‏ 10:25pm

I think this a reasonable request since it would greatly improve the search experience for Arabic and Persian sites. We’d love to review a PR that implements this feature, so I’m going to put a pr-welcome on it.

For anyone who decides to work on this feature: all the normalization logic should be gated behind a site setting that enables this by default for Arabic and Persian sites (see locale_default in site_settings.yml) and all other locales should have this setting off by default. Core already has similar normalization logic for accented characters (see lib/search.rb), so that would be a useful reference when implementing this feature.

serkhelesheyi · 1 באוקטובר,‏ 2025,‏ 10:44pm

תודה רבה, אוסמה! אני ממש שמח לראות שההצעה התקבלה בברכה.

supermathie · 2 באוקטובר,‏ 2025,‏ 1:22am

(לגבי חלק זה של הבעיה, אנחנו מדברים על נורמליזציה של תקן Unicode NFKC (כדי לבחור אחד)?

(אני אפילו לא בטוח מה אנחנו עושים… אני מניח שאנחנו מנרמלים טקסט בצינור הבישול?))

serkhelesheyi · 2 באוקטובר,‏ 2025,‏ 7:31am

I’m not a technical expert, but I’ve been researching this issue because I want to ensure that no search queries are missed on a Persian-Arabic bilingual Discourse forum. Since Discourse uses PostgreSQL, normalization becomes essential: a user might search using Persian characters, while the same word is stored using Arabic ones—or vice versa. Without proper normalization, the search will fail.
Based on what I’ve learned, using Unicode NFKC normalization is a solid starting point—it handles many compatibility cases like ligatures, presentation forms, and Arabic/Persian digits.

However, for Persian and Arabic text, NFKC alone is not sufficient. It does not normalize several critical character variants that are visually and semantically equivalent but differ at the binary level.

Below, I’m outlining the procedures and insights I’ve arrived at through my research and exploration.

Overall Design Strategy

Apply Unicode NFKC normalization first to handle ligatures, presentation forms, and digit unification.
Then apply custom character mappings in a defined order (e.g., normalize Hamza variants before Arabic Ya).
Separate normalization policies for storage vs. search:
- Use a Conservative profile for canonical storage (preserve ZWNJ, avoid semantic shifts).
- Use a Permissive profile for search (ignore ZWNJ, unify Hamza variants, normalize digits).
All mappings should be configurable via a centralized mapping table in the database or a Ruby hash in the application.

Normalization Profiles

Conservative (for storage)

Minimal transformation
Apply NFKC
Normalize Arabic Kaf/Ya to Persian equivalents
Remove diacritics
Preserve ZWNJ
Store as original_text + normalized_conservative

Permissive (for search)

Aggressive matching
Apply all Conservative rules
Remove/ignore ZWNJ
Normalize Hamza variants to base letters
Convert all digits to ASCII
Optionally unify Taa Marbuta → Heh
Used for query preprocessing

Comprehensive Mapping Table

Source	Target	Unicode	Notes
`ك`	`ک`	U+0643 → U+06A9	Arabic Kaf → Persian Kaf
`ي`	`ی`	U+064A → U+06CC	Arabic Ya → Persian Ya
`ى`	`ی`	U+0649 → U+06CC	Final Ya variant
`أ`, `إ`, `ٱ`	`ا`	Various → U+0627	Hamza forms → Alef
`ؤ`	`و`	U+0624 → U+0648	Hamza Waw
`ئ`	`ی`	U+0626 → U+06CC	Hamza Ya
`ء`	—	U+0621	Remove or preserve (configurable)
`ة`	`ه`	U+0629 → U+0647	Taa Marbuta → Heh (optional)
`ۀ`	`هٔ`	U+06C0 ↔ U+0647+U+0654	Normalize composed form
`ڭ`	`گ`	U+06AD → U+06AF	Regional variants
`‌`	—	U+200C	ZWNJ: preserve in Conservative, remove in Permissive
`٤`, `۴`	`4`	U+0664, U+06F4 → ASCII	Normalize digits
Diacritics	—	U+064B–U+0652	Remove all harakat
ZWJ	—	U+200D	Remove invisible joiners
Multiple spaces	Single space	—	Normalize spacing

Fast Mapping Snippet (for SQL or Ruby)

ك → ک
ي → ی
ى → ی
أ → ا
إ → ا
ؤ → و
ئ → ی
ة → ه
ۀ → هٔ
ٱ → ا
٤, ۴ → 4
ZWNJ (U+200C) → (removed in permissive)
Harakat (U+064B..U+0652) → removed
ZWJ (U+200D) → removed

Implementation in PostgreSQL

Create a text_normalization_map table
Use regexp_replace or TRANSLATE chains for performance
Optionally implement in PL/Python or PL/v8 for Unicode support
Normalize both stored content and incoming queries using the same logic

Indexing Strategy

Store normalized_conservative for canonical indexing
Normalize queries with normalize_persian_arabic(query, 'permissive')
If using permissive search, index must match the same profile
Optionally store both versions for cross-compare

Ruby Hash Example (for Discourse)

NORMALIZATION_MAP = {
  "ك" => "ک",
  "ي" => "ی",
  "ى" => "ی",
  "أ" => "ا",
  "إ" => "ا",
  "ٱ" => "ا",
  "ؤ" => "و",
  "ئ" => "ی",
  "ة" => "ه",
  "ۀ" => "هٔ",
  "۴" => "4",
  "٤" => "4",
  "\u200C" => "", # ZWNJ
  "\u200D" => "", # ZWJ
}

Performance and Practical Notes

Apply NFKC in the application layer (e.g., Ruby unicode_normalize(:nfkc))
Use separate indexes for conservative vs. permissive profiles
Avoid forced mappings of semantically sensitive characters (e.g., Hamza, Taa Marbuta) unless explicitly configured
Run A/B tests on real forum data to measure hit rate and false positives
Document each mapping with rationale and examples
Define unit tests in both Ruby and SQL for each mapping

Final Recommendation

Use Unicode NFKC as a base
Extend it with a custom mapping layer
Maintain dual profiles for storage and search
Implement normalization in both the app and database layers
Document and test every mapping
Build appropriate indexes (GIN + to_tsvector) on normalized columns

נושא		תגובות	צפיות
Search Issue: Titles with/without Arabic Definite Article "ال" are not treated as variations Support search	0	28	10 במאי,‏ 2025
Discourse should ignore if a character is accented when doing a search Feature search , completed	53	6059	13 בפברואר,‏ 2024
Diacritics and search Feature	17	2351	31 באוגוסט,‏ 2018
Suggestion of a slight improvement regarding French search processing ("œ" and "æ" special characters) Feature search , completed	7	1076	12 במרץ,‏ 2022
Search problems in v2.3 Support	15	1294	16 באפריל,‏ 2023