Some inconsistencies in search results

Hey all - I’m trying to wrap my head around how search results are generated and am running into some weird inconsistencies, I’d be appreciative if folks could point me in the right direction!

What I’m trying to do

I’m generating a monthy “this week in our project” report about activity in the project. It pulls from many sources, like GitHub and Discourse. I’d like to do the following:

  1. Get a list of all new Discourse posts within a specific range of time (e.g. 2019-05-01 -> 2019-05-31)
  2. Get a list of all Discourse topics that were commented on during this range of time
  3. Provide a short list of “most active topics” as well as a list of “most active new topics” with links to our Discourse
  4. (ideally, but can’t figure out how) Provide a list of new and active users on the Discourse site

Where I’m getting confused

I’m generating this report programmatically (w/ python) so have been looking into the Discourse API for this. It seems like the /search endpoint is the only way to request data within a range of dates (as opposed to using posts.json etc). However, the results that this endpoint returns seem to be off.

As an example, here are a few results, in each case I’m searching only for date ranges, no keywords:

  • If I search for after:2019-05-01 before:2019-05-03, then it returns 15 results
  • If I search for after:2019-05-03 before:2019-05-05 then it returns 11 results
  • If I search for after:2019-05-01 before:2019-05-05 then it returns 20 results

This confuses me, because I assume that the third search (which includes the full span of dates in the 1st and 2nd search) would return 15+11 = 26 results, instead of 20.

Could somebody explain this behavior to me? Or point me to a resource that goes into the search information more deeply?

Or more generally, I am open to “Chris you’re going about this the wrong kind of way, there’s a better way of getting this information” responses as well :slight_smile:*

*with the exception of “pay for the data exporter” plugin - I’m coming from an open source community w/o a ton of resources, and while I’m working on getting buy-in from folks to pay for a version of Discourse, we’re not there yet :wink:

(thanks for reading this slightly long post!)

1 Like

Perhaps @sam can advise you

1 Like

We add the filter like so:

(posts.created_at < '2019-05-03 00:00:00') AND (posts.created_at > '2019-05-01 00:00:00') 

What is happening here is that we always aggregate results:

If a term hits once on topic X on Monday and once on topic X on Wednesday then we only show the topic from Wednesday. If you need to dig through all the matches on topic X you always have search within topic.

3 Likes

Ahh that makes sense - so is there some way to inspect the results to tell how many posts have been made in that topic within the time frame given? Basically I just want to say “these are the most active topics in this time frame”. If I just went off of “most posts” then I’d just get the super popular topics that have been around forever with lots of old posts.

I think you are going to have to probably just lean of posts.json here it has a bunch of useful params that can make it fairly easy to simply update deltas. Then you can do whatever fancy you want on local.

https://github.com/discourse/discourse/blob/1cf0b549abc35f56b9fbb80e8aa05677d32b6e5a/app/controllers/topics_controller.rb#L202-L203

Thanks for the information! Hmmm, I’m a bit confused about the parameters you mention - in the documentation for the posts api:

https://docs.discourse.org/#tag/Posts%2Fpaths%2F~1posts.json%2Fget

if seems like there’s only a single parameter “before” and in that case it’s actually unclear to me where that parameter is supposed to go. Is there a different documentation that explains other functionality of this?

If I can get “all the recent posts just before an arbitrary date” that would work just fine as well I think!