Discourse PURGE cache method on content changes

How can I have Discourse query my cache server with the PURGE request when its content is updated?

Our server has a ton of RAM, so I run all my sties behind a varnish cache, which means that >80% of the queries that come to our sites never actually have to trigger any backend calls (no executing of server-side code, no need to make sql queries, etc). So >80% of our requests are super-fast and just returned from RAM.

I’d like to do this for our Discourse site as well. Most queries of a user reading around in our threads are read-only, and there’s no reason to bother Discourse at all for this activity; these requests should just be returned to the client directly from RAM by varnish without querying the Discourse backend at all.

Obviously the cache needs to be notified when the content for a given page changes so the next request for that page will actually call the backend–then subsequent requests for that page are returned from cache until the cache object expires (for us after 24-hours) or it’s PURGEd by a call from the application (ie: Wordpress, MediaWiki, Discourse, etc).

This is all pretty standard for web applications. For example, making PURGE calls when an article changes is built directly into MediaWiki:

And there’s lots of plugins for doing this in Wordpress. Personally, we use Razvan Stanga’s ‘vaching’ plugin:

So how can this be achieved with Discourse? How can Discourse be configured so that it sends PURGE requests to our defined cache servers when a change takes place to one of its pages (ie: a new post in a thread, an edit to an existing post, a user’s bio changes, etc)?

1 Like

Pinging @Lee_Ars – how did you get varnish cache invalidation working with Discourse?

The stack is HAProxy -> Varnish (cache) -> nginx reverse proxy -> Discourse

I don’t think this will work with a JavaScript application like Discourse? That kind of caching is for old school 1999 websites with static HTML and CSS.

3 Likes

Sorry, but I don’t follow this logic at all…

The function of Discourse is to produce HTML, CSS, and JS to be viewed by a web browser. It’s all still HTTP requests & responses that can be cached by varnish–even if they’re initiated by Javascript and broken up into smaller requests.

For example, when viewing this Discourse topic, my browser made an HTTP request to meta.discourse.org/.../posts.json that returned an HTTP response with a JSON payload including a posts_stream and an array of posts with their details (post id, username, avatar url, timestamps, etc). All of this data could be cached by varnish and returned to the client’s browser when a user views a topic page–and Discourse will never even have to be called.

Let’s start with the basics: can you point me in the correct direction for how I’d write a Discourse plugin such that Discourse would call a function I defined whenever a topic has a new post, ideally passing the topic id to my function?

If you look at that json payload, you’ll see that there is data in it that’s changed every time that it’s requested.

There is a bit of stuff that can be cached, but not as much as you think. Search here for cloudflare to get some ideas of the problems and some of the solutions.

Basically you can configure it as a CDN.

1 Like

Discourse is a JavaScript app. All the html is generated on the fly by the JavaScript app running in the browser.

And “caching” the json data responses would break the app because you would be looking at stale data. That’s like arguing you should “cache” the results of SQL queries from a database!

2 Likes

You would still miss out on a lot of features of Discourse if you purge your cache only when there’s a new post. So much can happen: edits, likes, flags to name just a few of the actions which affect how posts are shown.

After the initial load of Discourse the system loads only JSON files from the server which are really small and fast to load. Moreover, Discourse has a fantastic cache for anonymous users which works out of the box and reduces the server load a lot.

My recommendation: Don’t try to put another cache in front of Discourse. You will most likely break it.

7 Likes

Incorrect.

Discourse produces tailored views per-user. My view of /latest reflects my personal footprints across the site. My read progress per-topic is retained and reflected when I revisit topics. Categories I mute won’t appear.

If you want a discussion platform which works with old archaic bulk-caching methods discourse isn’t it.

6 Likes

This is not Discourse-specific. Most web applications display content tailored to logged-in users.

To be clear, I have no intention of caching content for users that are logged-in. Logged-in users’ requests typically go straight for the backend in web apps. It’s standard practice.

Consider a reddit hug-of-death: suddenly we have a huge spike of traffic accessing a single topic page, perhaps with smaller spikes to a few other related topics. Almost 100% of this traffic is read-only queries requesting a single page (or broken up into a few JS queries; it makes no difference to varnish) from users that are not logged-in.

In this case, our site could crash and then we could elastically spin-up more stateless web nodes ($$$). Or this small set of content could just be cached in RAM and returned directly to the users without bothering the backend (Discourse) at all – except for maybe one request every minute or so if/when a logged-in user adds a new post to the topic, a like is added, etc.

The benefits of caching before the backend are immense.

You need to spend more time looking at how discourse actually works (JavaScript app payload being delivered to the browser, network traffic during use) before posting your next wall of text.

Your responses evidence that don’t understand how discourse works on a fairly fundamental level.

Or go off and prove us wrong?

1 Like

Yes, as I said, please just point me in the right direction for the bascis: how would I write a Discourse plugin such that Discourse would call a function I defined whenever a topic has a new post, ideally passing the topic id to my function?

Of course it would be expanded for other updates requiring cache invalidation, but I would greatly appreciate if you could point me in the right direction for new posts to a topic as a starting point.

Great! Can you please link me to the documentation that describes this built-in cache?

You can look at the plugin development #howto topics. There are several plugins that hookinto new topics.you can have s look at the plugins in the discourse repository and find them. The github plugin acts on new posts, so you might look at it.

Is still the case that what you are trying to do is unnecessary, if you are to believe the people who developed the software and/or have spent years working with it, but perhaps you know better. You can read the source. It’s all there.

There is stuff in place that caches stuff for anonymous users. There is stuff in place to respond if the system is under heavy load.

5 Likes

It would be better for you to generate the heavy load and demonstrate the actual problem versus this imagined, conceptual idea of a potential problem … because we have some very large customers now, with quite a bit of traffic.

5 Likes

We do a lot to cache anon out of the box, we spend about 1-3ms on requests to anon cached payloads.

I guess with varnish you could reduce this 1-3ms down to fractions of a millisecond, but there is a big amount of logic you would need to replicate.

We will not be spending any time on this given we can comfortably host sites that see upwards of 100 thousand posts a week that feature very heavily on reddit.

5 Likes

Thank you! A link for the lazy:

Thanks Sam,

I still couldn’t find any documentation on this, so would you mind clairifying a few things?

What does “anonymous” mean to Discourse? Is it simply a user that is not logged-in?

Just grepping through /var/www/discourse on the Discourse docker container makes it look like the ANON_CACHE_DURATION is default at 60. I’m assuming that’s 60 seconds. Is that correct? Do you have any tips on if/when this value should be adjusted and how to tune it appropriately?

root@osestaging1-app:/var/www/discourse# grep -ir 'ANON_CACHE_DURATION' *
lib/middleware/anonymous_cache.rb:      env["ANON_CACHE_DURATION"] = duration
lib/middleware/anonymous_cache.rb:        @env["ANON_CACHE_DURATION"]
spec/components/middleware/anonymous_cache_spec.rb:      expect(new_helper("ANON_CACHE_DURATION" => 10, "REQUEST_METHOD" => "POST").cacheable?).to eq(false)
spec/components/middleware/anonymous_cache_spec.rb:      new_helper("ANON_CACHE_DURATION" => 10)
spec/components/middleware/anonymous_cache_spec.rb:      new_helper("ANON_CACHE_DURATION" => 10, "HTTP_USER_AGENT" => "AdsBot-Google (+http://www.google.com/adsbot.html)")
spec/components/middleware/anonymous_cache_spec.rb:      helper = new_helper("ANON_CACHE_DURATION" => 10)
spec/components/middleware/anonymous_cache_spec.rb:      helper = new_helper("ANON_CACHE_DURATION" => 10)
spec/components/middleware/anonymous_cache_spec.rb:      helper = new_helper("ANON_CACHE_DURATION" => 10, "HTTP_ACCEPT_ENCODING" => "gz, br")
spec/components/middleware/anonymous_cache_spec.rb:      helper = new_helper("ANON_CACHE_DURATION" => 10)
spec/components/middleware/request_tracker_spec.rb:      tracker.call(env("REQUEST_URI" => uri, "ANON_CACHE_DURATION" => 60, "action_dispatch.request.parameters" => request_params))
spec/components/middleware/request_tracker_spec.rb:      tracker.call(env("REQUEST_URI" => uri, "ANON_CACHE_DURATION" => 60, "action_dispatch.request.parameters" => request_params))
spec/components/middleware/request_tracker_spec.rb:      tracker.call(env("REQUEST_URI" => uri, "ANON_CACHE_DURATION" => 60))
root@osestaging1-app:/var/www/discourse#

As Jeff said:

Are you 1000% sure you really need this? It seems to me that you’re in for a world of pain and perpetual+unsupported maintenance here. You won’t be able to do a single update without merging your own changes, testing them and figuring out why things broke. All this time is not spent focusing on nurturing your community.

6 Likes

Oh, I’m <10% sure I really need this.

I’m in the very early proof-of-concept stages with Discourse installed in a staging environment. Currently I’m just trying to get it configured so it’s ready for production, and I’m poking & testing it. One of my tasks is to put it behind varnish, so the purpose of this topic is to research what that effort would entail.

I’m very happy to see that there is active responses on these forums about such questions. In the absence of documentation, this topic will not only help me; it will also help future sysadmins who want to install Discourse behind a cache as well. I may not decide that the level of effort is justified, but they may. All of this information is helpful to make such decisions, and I greatly appreciate your responses to my inquires.

If I find that the built-in Discourse cache is sufficient, I may run Discourse in production without a caching layer like varnish before it.

But, again, there doesn’t appear to be any documentation on the built-in cache, so I have a few questions about it:

Does Discourse invalidate its anonymous cache when there’s updates to the backend? Or does ANON_CACHE_DURATION define the maximum amount of time for which content will be served to clients stale?

What would be the consequences of setting ANON_CACHE_DURATION to 24 hours?

And what about visibility into the performance of the built-in Discourse caching system? Varnish has excellent ootb graphing with munin. For example, the following graph shows that for the past month, >80% of queries across all our websites were cache hits (meaning they were returned to the client’s browser directly from our server’s RAM without ever touching the backend web server. ~80% of the queries to our server don’t require server-side code like PHP to be executed and no DB queries need to be made). Obviously this hugely decreases the load on our server.

varnish_hit_rate-month

How can I see how many queries going to a Discourse install are cache hits? Is there a way to graph this data like the munin graph above for varnish?

These all feel like unnecessary questions if you just want to deploy the system, and questions better served by actually running an instance if you want to understand the system.

This is heavily complicated by the fact that Discourse itself has multiple layers of caching: nginx will directly serve uploads (sometimes), the Rails app passes off instructions to nginx to serve brotli versions of static content, the anon cache mentioned above, plus the Rails cache at the bottom which can result in “partially cached” requests.

“How many queries going to Discourse are cache hits” is a question where both the numerator and denominator are both extremely hard to calculate and of questionable value.

7 Likes