Effectively logging search queries

For 1.9 we plan to add a search log.

Unfortunately, the “log every search the server makes” approach is incorrect as we perform searches as people are typing which will result in a massively noisy log.

Proposal

  • Create a new table

term, user_id (nullable), ip_address, created_at, clicked_topic_id (nullable), source_id (either header or fullpage)

  • Log on server with the following algorithm on search
UPDATE term
SET term = :new_term
        created_at: :now
WHERE created_at < 5.seconds.ago AND
             position(term in :new_term) = 0 AND
             (user_id = :user_id OR ip_address = :ip_address)

term: new_term,
now: Time.zone.no,
ip_address: request.ip 

If update touches zero rows, then insert a **new** search log row

Or, in English

  • Update existing search log row IF:

    • Same user (for anon use ip address, for logged in use user_id)
    • Previous search started with the text of current search, eg: previous was “dog” and new is “dog in white”
    • Previous search was logged less than 5 seconds ago
  • On click on search result (in either full page search or header) update the clicked_topic_id, (have search results return log id, then update it based on log id + user match + in last 10 minutes)

Limiting log size

So the log does not grow forever there should be a site setting for maximum rows to store. Default should be about a million.

A weekly job can delete oldest rows or something.

Thoughts? Feedback?

12 Likes

So if I search for something and open three results in three tabs, what happens? Last one “wins”?
:arrow_right: The behavior of “multiple clicks for one search, save last only” is better for times when someone searches, clicks, hits back, clicks again, hits back, clicks a final time. New tabs subvert that though.

If I search for a topic, right click to copy URL, and then paste that in as a “you should look here” answer, is that counted as a “click”?
:arrow_right: Copying URLs from posts does not increase click count, for reference. But this is a case when the desired search result really should be saved.

1 Like

Hopefully we can make a little progress on this, maybe this week @eviltrout?

1 Like

I’ve implemented this mostly as specified in the original post:

https://github.com/discourse/discourse/commit/97e211f8372993622698aceb3eeb801a150aa9e5

I also went in and replaced all the mock tests with integration tests and added a separate commit for the clean up job.

8 Likes

Actually I forgot about the click tracking :slight_smile:

Sorry it’s been a difficult couple of days for me, I’ll get it in on Monday.

4 Likes

This commit adds click tracking support:

https://github.com/discourse/discourse/commit/cdb3706025b5c7b233ec7a36f3c0ac29d6970254

9 Likes

May someone please explain how the following columns in the current search_logs table work?

  • search_result_id
  • search_type
  • search_result_type

Also, it would be useful if all columns contained detailed description, and those descriptions were shown in the Data Explorer plugin, here:

image

I’m not sure what you mean by “how they work” but as far as for the “descriptions” those are “column definitions” or “schema information” and are recognizable as such for those experienced with database tables.

To gain an insight on how they “work” you can look at the tables corresponding model. For example:
https://github.com/discourse/discourse/blob/master/app/models/search_log.rb#L95-L106

and near the top of that file you’ll see
https://github.com/discourse/discourse/blob/master/app/models/search_log.rb#L6-L20

From there you can search for variable and function names etc to find other files that interact with the table (a good IDE or text editor helps a lot).

Does that help any?

3 Likes

Yeah, sorry for being unclear. By this I just mean “what they mean”.

A description for every column would be useful in database schema though.

@techAPJ was the last to work on this, so maybe he can comment.

1 Like

I guess I may be peculiar, but for me the definitions in combination with good descriptive field names are what I’m used to reading when I read schema.

It requires a shift from reading English to reading code, and I have doubts that “translating” code to English would be practical if even possible. That said, it is often easy enough to explain a small piece of code to answer a specific question.

I hope the code @Mittineague linked above explains what they mean.

Primary ID of search_result_type entity.

3 Likes

This is done and dusted, closing.

2 Likes