Hello Discourse team,
We are running a multilingual forum with significant Arabic and Persian content, and we’ve encountered a critical limitation in the search functionality related to Arabic orthographic normalization.
Problem Description
Arabic script includes multiple Unicode representations for semantically identical characters. Unfortunately, Discourse’s current search engine appears to treat these variants as distinct, which leads to incomplete or misleading search results.
Examples:
- Searching for
إطلاق مقامي
only returns exact matches, while posts containingاطلاق مقامي
,أطلاق مقامي
, orإطلاقمقامي
are excluded. - Similarly, searching for
ي
(U+064A) does not matchی
(U+06CC), andك
(U+0643) fails to matchک
(U+06A9), despite their functional equivalence in Arabic/Persian contexts.
This affects not only hamza variants (أ
, إ
, ء
, ؤ
, ئ
) but also common substitutions like:
Character | Unicode | Suggested Normalization |
---|---|---|
أ , إ , ء , آ |
U+0623, U+0625, U+0621, U+0622 | Normalize to ا |
ؤ |
U+0624 | Normalize to و |
ئ |
U+0626 | Normalize to ي |
ى |
U+0649 | Normalize to ي |
ة |
U+0629 | Normalize to ه |
ي vs ی |
U+064A vs U+06CC | Normalize to ی |
ك vs ک |
U+0643 vs U+06A9 | Normalize to ک |
This issue is compounded when users omit diacritics or use different keyboard layouts, resulting in fragmented search behavior.
Proposed Solution
We recommend implementing a Unicode-aware normalization layer during both indexing and query parsing. This can be achieved by:
- Preprocessing both indexed content and user queries to unify character variants.
- Applying normalization rules similar to those used in Arabic NLP libraries or search engines (e.g., Farasa, Hazm, or custom regex-based mappers).
- Optionally, supporting fuzzy matching or Levenshtein distance for near-exact matches.
Here’s a simplified example of a normalization function (Java-style):
public static String normalizeArabic(String text) {
return text.replace("أ", "ا")
.replace("إ", "ا")
.replace("آ", "ا")
.replace("ؤ", "و")
.replace("ئ", "ي")
.replace("ى", "ي")
.replace("ة", "ه")
.replace("ي", "ی")
.replace("ك", "ک");
}
Request
Could this normalization be considered for inclusion in the core search engine or as a plugin? It would significantly improve usability for Arabic and Persian communities using Discourse.
If there’s an existing workaround or plugin that addresses this, we’d appreciate any guidance.
Thank you for your time and for building such a powerful platform.
Best regards