Discourse's Search Implementation

(James Kiesel) #1

Hey Discourse devs,

I’m working on improving a full text postgres search and decided to have a peek through Discourse’s implementation, since our use cases are very similar.

Was wondering if any devs (or otherwise knowledgeable persons) might have some insight into the logic behind Discourse search (which, I have to admit, is much more logic-heavy than I was expecting it to be.)

Some specific questions:

  • Why the differentiation between a Post and its PostSearchData (ie, why is just putting the ts_vector column straight onto the Post a bad idea?)
  • What’s the reasoning behind having separate SearchData classes for each searchable type (PostSearchData, CategorySearchData, etc.), instead of making it a polymorphic relationship? (ie a searchable has_one search_data) Would the resulting generic SearchData table be too massive to work with?
  • I see there’s a need to occasionally reindex the search (using the rebuild_problem_posts method); why is that? Does the search data go off over time for some reason?

Any other specific things to be aware of while putting a search like this together?

Thanks so much!

(Dean Taylor) #2

I’m pretty sure rebuild_problem_posts deals with problems relating to when the locale changes and the need to re-index posts that use the previous locale.

I just remember lots of discussion regarding locale issues at the time, @sam would know that stuff.

EDIT: I’m not a project dev by the way. :wink:

(Erick Guan) #3

It reduces the index size, and the model is cleaner.

I think it gives not much help. We still need WHERE clause to select corresponding data.(Discourse can search in some specific context.)

Though I believe we need to use concern over observer as the rails team suggest. Maybe the next refactor.

Let me give a example here. CJK text is not separated by space. There should be some way to break down the sentences to the words which needs some programs to help. After we add some changes to these logic, we have to reindex.