Database full text search capabilities/limits?

Continuing the discussion from Search results should prioritize first post in topic when title matches search term:

This is way OT for the original discussion, but I’m not sure whether to be merely surprised, or full-on appalled:

(Emphasis mine)

I’m not about to dispute database choices, as I have no experience here, but this statement implies that Postgres does not differentiate internally between these cases. Is this kind of limit common among database text search engines? If so, that would go far to explain why most site-search facilities are total garbage…

Let me clarify this statement. For exact matches, there is no need to rank the results since there is either a match or there isn’t. For contain matches, how will we decide which should rank higher?

  1. This is my search term
  2. Search term is interesting
  3. Search term is more interesting here since the title is longer

All the above titles contains the search term so the question is how do rank the results? Currently we rely on Postgres’s ranking function to determine this for us.

Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first. PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.

There are a bunch of factors that Postgres takes into account but ultimately ranking is still somewhat vague and really depends from site to site depending on how the content of the forum is structured. Our aim here at Discourse is to provide a good default that would work for most of the forums out there.

5 Likes

Ahh… So, probably some misunderstanding on my part here. I thought that you meant that the database couldn’t differentiate between an exact match and a contains match.

In the context of prioritizing matches in topic titles, as we were originally discussing:

  • I’d actually call all three of your examples “Exact matches” for the purpose of my intent.
  • “There is a term you can Search” would be a “contains” match.

  • Your three examples are equal in ranking. Just list them in the order that they are found.
  • My example ranks below them
  • They all, though, rank before any posts that are matching/relevant only because of post content without the title.

If you’re concerned that top search results will get flooded with “Title” matches, only use the title search when the search term has reached a minimum complexity level. Target is at most 4/5 “Title” matches at the top of search results. (Though I see that would probably be difficult to figure out)