Search for wb.camra.org.uk doesn't return any results

Anyone hazard a guess why searching for “wb.camra.org.uk” doesn’t find any posts when that text is most certainly in a post. Whereas a search for “wb.camra” does work.

image

Kind of proving the text does exist as I asked the same question in our own support category:

image

My guess is “camra” is the only string long enough to get past the minimum search limits. I suspect “camra” is all you are really searching for there.

So why does adding more break the search? Now that this post has been made, it’s reproducible here on Meta.

Remember that in normal human being talk, periods end sentences.

So you definitely wouldn’t want to search for just “org” or “wb” or “uk”.

Long unique strings is where it’s at, from a searching perspective. Searching for “to” or “by” ain’t not never no good no how.

But a search for “wb org uk” does find the first post.

I would say searching for website addresses is a pretty common requirement as it’s a unique identifier. And yes, it is now repeatable on here. I think people would complain if searching for www.bbc.co.uk didn’t work on Google.

Maybe somebody who knows how the search engine works could comment on what Discourse is doing with the search string? Is it getting processed so that it’s not searching on what one types?

Later… and yes, having included www.bbc.co.uk in this message, Discourse is unable to find it.

1 Like

searching for camra.org seems to work OK. As does camra.

As to what bits of Postgres innards don’t like camra.org.uk and wb.camra.org and wb.camra.org.uk I can’t say. There is a reason there are tens of thousands of people working at Google, and literally nobody uses Bing even though it air-quotes “works”

Although this is not exactly the same issue - something that might be relevant is this:

@sam did put a fix in for something similar to this sort of thing back in July 2016:
https://github.com/discourse/discourse/commit/12ecf8624a97c3d33aaa3cd56851b5f1ca347c90#diff-995f4013dde76fb0460ecdb36fa20d53

Where the last word wasn’t searchable.

Although the file app/models/search_observer.rb doesn’t seem to exist in the current version of the code I assume those fixes are still in there somewhere…

Because the tests still exist for it:
https://github.com/discourse/discourse/blob/master/spec/components/search_spec.rb#L701-L704

But as I said before - perhaps a slightly different issue.


EDIT: found the current relevant code:
https://github.com/discourse/discourse/blob/master/app/services/search_indexer.rb#L23-L31

5 Likes

I don’t really buy the “Google is awesome at search, therefore no point improving search cause Google is awesome at search” argument :stuck_out_tongue_winking_eye:

1 Like

There is absolutely point in improving, but we may need to hire 10,000 folks to reach parity with Google’s efforts…

Sure, but… I can fix this specific bug

4 Likes

Is this a bug in your code, then? Is that what you’re saying?

The pg tokenizer is dumb, I already have workarounds for some edge cases, this is another example

5 Likes

Maybe there will be some improvements when we upgrade to PG 10 later this year.

Have not done root cause analysis here, this could very much be a bug in my code for all I know

It looks like it tokenizes right, at least on my local install and meta.

[5] pry(main)> Post.exec_sql("SELECT * FROM ts_debug('english', 'wb.camra.org.uk');").to_a
=> [{"alias"=>"host",
  "description"=>"Host",
  "token"=>"wb.camra.org.uk",
  "dictionaries"=>"{simple}",
  "dictionary"=>"simple",
  "lexemes"=>"{wb.camra.org.uk}"}]

So this is likely an implementation bug on our side.

6 Likes

Fixed per:

https://github.com/discourse/discourse/commit/6002f2ca4ac0657c8129b492d997d13889bdda12

Thanks @DeanMarkTaylor for finding the exact reason for the bug!

Note you will need to re-index posts with said issue, which means you have to edit them (I am not 100% sure if rebake will catch this)

search for: wb.camra.org.uk :mag:

9 Likes

This topic was automatically closed after 25 hours. New replies are no longer allowed.