Now should I only do myself? No. We should both do ourselves

Apologies for the bad title, but I’m trying to make a point.

Every single word in my title (until someone changes it, and then for the record it was "Now should I only do myself? No. We should both do ourselves. ") is on the Postgres stop words list. This topic title is invisible to Postgres. I don’t mind stop words being used for indexing topic bodies, but it irks me a lot to have them used in topic titles. Stop words are sometimes the key words to distinguish one topic from another, so I’d like my searches to be for the terms I’ve put in.

This is not a bug in Discourse, it is just using the defaults from Postgres, but it does cause user confusion when words get ignored.

Please, please, find a way around the stop list for topic titles.

3 Likes

Where has this been an issue in practice? I’m not a fan of completely artificial examples.

I have seen forum game topics titled “This or That” which is another completely stopword list title.

And recently I found a book in a book store which has a title that is on Amazon’s stop word list. That issue made me think of this problem again. (The book is titled The Book and is a history of books and related technologies, like paper making.)

Granted, there will be less topic titles than post content.

But there is good reason for using dictionaries of stop words - Performance

Simply having no stop words would likely slow database full text searches to a crawl.

I think if this is a real issue for you, you should look into creating a custom dictionary for your site.

https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html

I suspect that is not true. For full text, maybe. For titles? That’s a much smaller working space. This forum has around 50k titles if I’m correct in thinking that …ourselves/49417 shows me a sequential number. Most of those will have under a dozen words (mine is exactly a dozen). Indexing 600,000 words should not be an issue for a modern system. How many posts here will use the word “Discourse”? The results cap at 50, but I bet it is nearer to 50k than 50. (Google tells me “about 17,800” for “site:meta.discourse.org discourse” and I suspect it is an undercount for “posts”.) And yet postgres has no problems searching for a word with that many results.

1 Like

I guess if your testing has proven the Postgres documentation is in need of correction. i.e.

Aside from improving search quality, normalization and removal of stop words reduce the size of the tsvector representation of a document, thereby improving performance.

There is a link at the bottom of the page to a form where you should submit your findings so others can benefit from the improved knowledge.

https://www.postgresql.org/account/login/?next=/account/comments/new/9.1/textsearch-dictionaries.html/

It is possible, however, it increases the bar significantly for performance issues:

https://blog.codinghorror.com/stop-me-if-you-think-youve-seen-this-word-before/

Unlikely this is anything we would attack super soon. The answer is “use Google in these cases”. Have you tried that? Our 404 page includes a google search box as does the search help.

https://meta.discourse.org/search?q=Now%20should%20I%20only%20do%20myself%3F%20No.%20We%20should%20both%20do%20ourselves%20

It’s not on the zero results search page. And the first words of the search help are “Title matches are prioritized – when in doubt, search for titles”.

I’m asking for title matches to get real priority.

Feel free to submit pull requests. A Google search on the no search results page is a good idea.

4 Likes

I want to revisit this and add it to someone’s list because I like it a lot. @neil can you take? Try to normalize the search code that’s on the “topic not found” page so we aren’t duplicating stuff everywhere, and bear in mind the public vs. private site caveats, we don’t want to show this on a private site…

2 Likes

I think the copy needs to be different here since it adds a second search field.

(and why is that New Topic button there?)

4 Likes

The idea was

Oh, I can’t find this, I should create a new topic since it doesn’t exist!

We should maybe move to a more conversational UI here like we do at the bottom of topics

Want to read more? Browse other topics in #feature or view latest topics.

Such as

Can’t find what you’re looking for? Start a new topic or use Google to search instead.

4 Likes

It would be nice if the domains to be searched by Google could be customized in site settings. That way, one could include a WordPress site, for example. Or was “search this site” already thought as including the entire domain, not just the forum’s sub-domain?

@codinghorror How about this? (Putting ember link-to’s inside translated strings is awkward)

2 Likes

Looks good but delete the header above the search box.

1 Like

Why remove the header? It is currently on the search page with or without results found, seems weird to remove it.

Existing no results page:

Existing results found page:

I am referring to the header containing the words “Search this site”, in the prior screenshot. To be crystal clear:

Also minor point but @neil I think this reads better as

Can’t find what you’re looking for? Start a new topic, or search with Google instead:

edit: I see the problem, search with google may not be available (private site) and creating a topic may not be available (you don’t have permissions) so these have to be two complete, independent sentences. OK, fine as is then @neil!

4 Likes

Also @elijah here’s the actual list of stopwords:

https://github.com/postgres/postgres/blob/master/src/backend/snowball/stopwords/english.stop

So yeah the title is literally all stopwords from this list, :clap:

4 Likes