main ← cldr-search-aliases
opened 09:20PM - 22 Apr 26 UTC
## Summary
Layers Unicode CLDR annotations on top of the Telegram search aliase…s (added in #12). CLDR is authoritative, maintained by the Unicode Consortium, and covers many more locales — including ones Telegram doesn't (bg, bs_BA, el, en_GB, et, hy, lt, lv, pt European, sl, sq, sw, te, th, ur, vi).
Each source now lives in its own directory, and a new rake task unions them into the final path Discourse consumes:
- `dist/telegram_search_aliases/<locale>.json` — from `emojis:telegram:import`
- `dist/cldr_search_aliases/<locale>.json` — from `emojis:cldr:import`
- `dist/locale_search_aliases/<locale>.json` — merged output, produced by `emojis:search_aliases:merge`
## Locale mapping
CLDR uses locale inheritance: child files contain only deltas over a base. An explicit table in the rake task maps Discourse locales to an ordered list of CLDR sources to union. The one upstream flagged specifically:
- Discourse `pt_BR` ← CLDR `[pt]` (pt.xml is Brazilian Portuguese in modern CLDR)
- Discourse `pt` ← CLDR `[pt, pt_PT]` (base plus European overrides)
Same pattern handles `en_GB`, `zh_TW` (← `zh` + `zh_Hant`), etc.
## Impact
16 brand-new locales gained coverage (previously unsupported); 32 existing locales got enriched with CLDR-only keywords. Across all locales: **41,284 → 87,831 emoji-alias entries**.
| locale | before | after | delta |
|----------|-------:|------:|--------:|
| ar | 1834 | 1847 | +13 |
| be | 428 | 1812 | +1384 |
| **bg** | 0 | 1804 | +1804 |
| **bs_BA**| 0 | 1802 | +1802 |
| ca | 1862 | 1879 | +17 |
| cs | 1784 | 1877 | +93 |
| da | 44 | 1803 | +1759 |
| de | 1750 | 1856 | +106 |
| **el** | 0 | 1804 | +1804 |
| en | 1735 | 1837 | +102 |
| **en_GB**| 0 | 1804 | +1804 |
| es | 1867 | 1878 | +11 |
| **et** | 0 | 1803 | +1803 |
| fa_IR | 1719 | 1861 | +142 |
| fi | 1829 | 1848 | +19 |
| fr | 1792 | 1843 | +51 |
| gl | 34 | 1804 | +1770 |
| he | 1779 | 1859 | +80 |
| hr | 1823 | 1848 | +25 |
| hu | 1750 | 1848 | +98 |
| **hy** | 0 | 1804 | +1804 |
| id | 1006 | 1822 | +816 |
| it | 1855 | 1876 | +21 |
| ja | 651 | 1807 | +1156 |
| ko | 97 | 1804 | +1707 |
| **lt** | 0 | 1804 | +1804 |
| **lv** | 0 | 1804 | +1804 |
| nb_NO | 1857 | 1879 | +22 |
| nl | 1151 | 1834 | +683 |
| pl_PL | 1858 | 1877 | +19 |
| **pt** | 0 | 1804 | +1804 |
| pt_BR | 1774 | 1863 | +89 |
| ro | 73 | 1801 | +1728 |
| ru | 1869 | 1879 | +10 |
| sk | 352 | 1813 | +1461 |
| **sl** | 0 | 1800 | +1800 |
| **sq** | 0 | 1802 | +1802 |
| sr | 1844 | 1860 | +16 |
| sv | 1651 | 1871 | +220 |
| **sw** | 0 | 1803 | +1803 |
| **te** | 0 | 1804 | +1804 |
| **th** | 0 | 1804 | +1804 |
| tr_TR | 1333 | 1831 | +498 |
| uk | 1670 | 1852 | +182 |
| **ur** | 0 | 1804 | +1804 |
| **vi** | 0 | 1803 | +1803 |
| zh_CN | 105 | 1804 | +1699 |
| zh_TW | 108 | 1805 | +1697 |
Bold rows are locales that had zero coverage before this PR.
## Workflow
```
rake emojis:telegram:import[all] # refreshes dist/telegram_search_aliases/
rake emojis:cldr:import[all] # refreshes dist/cldr_search_aliases/
rake emojis:search_aliases:merge # writes dist/locale_search_aliases/
```
Existing Telegram data was moved via `git mv` so history for the per-locale Telegram files stays intact.
## Test plan
- [x] `bin/lint` clean on all new/changed rake files
- [x] Ran `emojis:cldr:import[all]` end-to-end (48 locales fetched)
- [x] Ran `emojis:search_aliases:merge` — all 48 locales written
- [x] Spot-checked `pt_BR`: Telegram keywords preserved, CLDR additions appended, dedupes overlap
- [x] Spot-checked `pt` (European): base `pt` keywords plus `pt_PT` overrides (1361 emojis with divergent aliases vs `pt_BR`)
- [ ] Verify the Discourse emoji picker search picks up new keywords once this package is bumped