Search for wb.camra.org.uk doesn't return any results


(Rob Nicholson) #1

Anyone hazard a guess why searching for “wb.camra.org.uk” doesn’t find any posts when that text is most certainly in a post. Whereas a search for “wb.camra” does work.

image

Kind of proving the text does exist as I asked the same question in our own support category:

image


(Jeff Atwood) #2

My guess is “camra” is the only string long enough to get past the minimum search limits. I suspect “camra” is all you are really searching for there.


(Eli the Bearded) #3

So why does adding more break the search? Now that this post has been made, it’s reproducible here on Meta.


(Jeff Atwood) #4

Remember that in normal human being talk, periods end sentences.

So you definitely wouldn’t want to search for just “org” or “wb” or “uk”.

Long unique strings is where it’s at, from a searching perspective. Searching for “to” or “by” ain’t not never no good no how.


(Eli the Bearded) #5

But a search for “wb org uk” does find the first post.


(Rob Nicholson) #6

I would say searching for website addresses is a pretty common requirement as it’s a unique identifier. And yes, it is now repeatable on here. I think people would complain if searching for www.bbc.co.uk didn’t work on Google.

Maybe somebody who knows how the search engine works could comment on what Discourse is doing with the search string? Is it getting processed so that it’s not searching on what one types?

Later… and yes, having included www.bbc.co.uk in this message, Discourse is unable to find it.


(Jeff Atwood) #7

searching for camra.org seems to work OK. As does camra.

As to what bits of Postgres innards don’t like camra.org.uk and wb.camra.org and wb.camra.org.uk I can’t say. There is a reason there are tens of thousands of people working at Google, and literally nobody uses Bing even though it air-quotes “works”


(Dean Taylor) #8

Although this is not exactly the same issue - something that might be relevant is this:

@sam did put a fix in for something similar to this sort of thing back in July 2016:

Where the last word wasn’t searchable.

Although the file app/models/search_observer.rb doesn’t seem to exist in the current version of the code I assume those fixes are still in there somewhere…

Because the tests still exist for it:

But as I said before - perhaps a slightly different issue.


EDIT: found the current relevant code:


(Sam Saffron) #9

I don’t really buy the “Google is awesome at search, therefore no point improving search cause Google is awesome at search” argument :stuck_out_tongue_winking_eye:


(Jeff Atwood) #10

There is absolutely point in improving, but we may need to hire 10,000 folks to reach parity with Google’s efforts…


(Sam Saffron) #11

Sure, but… I can fix this specific bug


(Jeff Atwood) #13

Is this a bug in your code, then? Is that what you’re saying?


(Sam Saffron) #14

The pg tokenizer is dumb, I already have workarounds for some edge cases, this is another example


(Jeff Atwood) #15

Maybe there will be some improvements when we upgrade to PG 10 later this year.


(Sam Saffron) #16

Have not done root cause analysis here, this could very much be a bug in my code for all I know


(Sam Saffron) #17

It looks like it tokenizes right, at least on my local install and meta.

[5] pry(main)> Post.exec_sql("SELECT * FROM ts_debug('english', 'wb.camra.org.uk');").to_a
=> [{"alias"=>"host",
  "description"=>"Host",
  "token"=>"wb.camra.org.uk",
  "dictionaries"=>"{simple}",
  "dictionary"=>"simple",
  "lexemes"=>"{wb.camra.org.uk}"}]

So this is likely an implementation bug on our side.


(Sam Saffron) #18

Fixed per:

Thanks @DeanMarkTaylor for finding the exact reason for the bug!

Note you will need to re-index posts with said issue, which means you have to edit them (I am not 100% sure if rebake will catch this)

search for: wb.camra.org.uk :mag:


(Sam Saffron) #19

This topic was automatically closed after 25 hours. New replies are no longer allowed.