Optimizing Discourse search for CJK languages

:bookmark: This guide explains how to adjust Discourse settings to better accommodate Chinese, Japanese, and Korean (CJK) languages in site search.

:person_raising_hand: Required user level: Administrator

Discourse’s default settings may not be optimal for CJK (Chinese, Japanese, Korean) languages. This guide will walk you through the necessary adjustments to improve the user experience for CJK language users.

Summary

We’ll cover the following adjustments:

  1. Modifying search term length
  2. Allowing uppercase posts
  3. Adjusting minimum post and topic title lengths
  4. Setting appropriate entropy values
  5. Troubleshooting search issues

Adjusting site settings

To make these changes, navigate to your site’s admin panel and search for the following settings:

1. Minimum search term length

Set min_search_term_length to 1 or 2.

CJK keywords can be as short as two characters, so it’s important to lower this value to allow for effective searching.

2. Allow uppercase posts

Enable the allow_uppercase_posts setting.

Discourse may not recognize CJK characters when analyzing topics. Enabling this setting prevents users from encountering errors when creating post titles.

3. Minimum post length

Set min_post_length to approximately 8.

This value provides a reasonable minimum length for sentences in CJK languages.

4. Body minimum entropy

Set body_min_entropy to about half of the min_post_length value.

Reduplication is common in CJK languages, and these characters are meaningful. Setting this value too high may result in users encountering “not meaningful post” errors.

5. Minimum topic title length and entropy

Adjust min_topic_title_length and title_min_entropy similarly to the post length and body entropy settings.

6. Minimum title and body similarity length

Set min_title_similar_length and min_body_similar_length according to the values you’ve assigned in the previous steps.

Troubleshooting search issues

If you encounter problems with search functionality after making these changes, you may need to reindex your database. Here’s how to do it:

  1. Enter your Discourse Docker installation directory.
  2. Run the following command to access the app container:
    ./launcher enter app
    
  3. Once inside the container, run the reindexing command:
    rake search:reindex
    

After reindexing, you should be able to search content effectively.

Last edited by @hugh 2024-07-26T01:02:02Z

Last checked by @hugh 2024-07-26T01:02:07Z

Check documentPerform check on document:
7 Likes

We already do this (by default) now as per:

https://github.com/discourse/discourse_docker/blob/762d9bbf6827d25295923b3ff0145d80008f0d41/templates/postgres.9.5.template.yml#L151

@sam is this topic still relevant?

2 Likes

Encoding for postgresql is perfectly fine now. 1 and 2 are kind of essential settings for CJK users as well as other settings to adjust post length/title length restriction.

Reindexing is still a trick for troubleshooting but rarely used.

6 Likes

I wiki’d it, so feel free to remove the stuff that’s redundant now.

4 Likes