Discourse needs better search

A possible approach @Falco could be a reverse approach of what our icurrent one does

For each topic try to extract/create 20 or so keywords and layer them on top of the existing keywords

I wonder if that helps

2 Likes

Our relevance search does not take into account views or pagerank. And to add complication, views in all time can get really high skewing stuff so we probably would need views by year or something to correct for that.

But… with pagerank / accounting for view counts / likes it is possible we can come up with a far better relevance algorithm.

This is complex stuff, a multi trillion dollar company was built on these ideas and another multi trillion dollar company has no easy way catching up.

8 Likes

There I fixed it … at #1 now.

I discussed this issue with @tgxworld and @JammyDodger in the past, we baked ourselves a very bad cake here.

The simple workaround is

Going through every single plugin topic and appending “Plugin” at the end.

Discourse Advertising Plugin
Discourse Chat Plugin
and so on…

Title matches “win” so for example

  • Advertising in category plugin will lose to Discourse Advertising Plugin question in category random.

We could “bloat” our title index by appending category and tags - I think this is what google do anyway.

So instead of indexing:

first priority “Discourse Advertising”
second “plugin”
third priority “content”

We could index

first priority “Discourse Advertising - plugin tag1 tag2”

Of course a workaround is searching for:

#plugin chat

vs


FWIW … might as well go and fix up all the official plugins now, will only take me a few mins.

image

4 Likes

How about taking into account the number of links to the topic?

1 Like

Yes, that is page rank, I mentioned that

So many trade offs though, should an exact title match lose to high page rank?

1 Like

No. Exact titles are what I must often look for, but I’m pretty special. When I’m looking for a “why didn’t you do a search” link I’m mostly looking for things I know exist (a step away from standard install, for many months I was stumped that “straightforward” would no longer find the Configure direct-delivery incoming email for self-hosted sites with Mail-Receiver, but I recently got it renamed so “mail receiver” works)

Ah. Now I see that you said that.

For the things I actually search for that I don’t know that I’m looking for, the most-recent usually does best.

FWIW, on my own (largely just for me) sites, with relatively few topics and posts, I think search works pretty well!

3 Likes

This is the way, there are many search tools to test before wasting too much effort on the internal one. I don’t know any site with an internal search that doesn’t get this complain. Even reddit which is one of the largest sites around get criticized for their search.

1 Like

By correlating user behavior during searches and reading (and possibly through inquiries, as Google Maps does, for example), Discourse could internally generate knowledge about anticipated outcomes of queries.

I also wonder if AI could help steer a conversation towards the desired results. Such a dialog could start with a button that says: “I am dissatisfied with the results”. The role of the AI would then be to ask questions whose answers either narrow down the range of outcomes or prioritize them appropriately.

A typesense plugin sounds amazing.

Good topic! Search in forums is a really tricky thing, and the solution of using Google tends to come up a bit too often for my tastes.

Would agree here. You don’t want old topics to dominate your search results.
Judging from my own search expectations, I would want the best results to be threads that are both recent and active, and which are a good match in terms of title and category. And even after that I would prefer recency to have a notable impact, because I often search for things that I vaguely remember.

Unfortunately also true. Personally, I’m not even sure how much links would really contribute to relevance (though they probably would be a factor), because in the forums I’m active in, but which are not support or technical forums of some kind, linking is relatively rare.
So I tend to consider recency and activity, i.e. number of views, likes/reactions, replies, within the not-too-distant past more important (not if this is also factored into the current search implementation or not).

2 Likes

I think it’s worth looking the algorithm reddit uses for it’s “hot” score:

math - Where do mathematical algorithms for Reddit’s ranking, as an example, come from? - Stack Overflow

That is something like

image

1 Like