Huge increase in Redis use after changing hosts

We recently had to bring our Discourse forum on prem, and overall had a pretty painless experience. We have noticed, though, a pretty massive (10-15x) increase in time spent in Redis per transaction. This is the last three days:

and the same time period, the week before our migration:

We aren’t quite sure what to make of this. Overall the hosts are still pretty snappy at 64ms, but previously we were sitting at around 40ms. This, translates, though, into some real slowdowns on the front end, with our page load time increasing over a second on average.

There are a number of things that are different about these installs, including the version of Discourse - previously we were on 1.8, now we’re on 2.0.3 - and I am not sure how alarmed I should be about these numbers. Most of the changes are in the hosting environment- we are using Discourse’s official image. There has been some custom work done to integrate discourse into our internal auth, but outside of that all of our other plugins are the same.

So, my questions are:

  1. Are these numbers alarming? (I think they are at the very least concerning)
  2. What’s your (Discourse team, any interested party) immediate intuition on where to start investigating/troubleshooting the increase?
1 Like

One other change to throw into the mix:

Before we migrated, a user had to explicitly create an account in Discourse, and had to ‘log in’ to Discourse, even if already logged into their service account. Our auth changes integrate into Discourse’s SSO and when a user visits the forum, an account is automatically created, or they are auto-logged-in if they’re already logged into the service. This has created a surge in both new users and logins.

  1. Is this redis inside the default Discourse docker container, self hosted Redis elsewhere, or Elasticache Redis? If using Elasticache Redis can you share the instance type?

  2. Prometheus exporter plugin for Discourse can give you redis time per page type, like on the topic page and on the category page. Your graph can give you this? That way we can compare numbers.

Alarming indeed.

Here in Meta (topics show controller) running in AWS Elasticache t2.medium we get around 20ms for p99 and 10ms median.

7 Likes

@Falco

It is hosted outside of the container, not in Elasticache, but it’s closer to a m3.large- 2 cpu cores, 8gb RAM.

Yeah, we can compare by transaction, the majority of time is spent in:
MessageBus:

TopicsController#show:

:thinking: a slightly slower message bus should not result in slower page load time. It’s for background tasks.

1 Like

I agree, I am not casting a ton of suspicion on the message bus, but since a lot of time is spent there is demonstrative of the change. This is from the week before we migrated:

Hmmm, not to brag about it but our bare metal redis perf is a bit crazy with 2ms median / 8ms p99.

Hard to get similar numbers in the cloud (right now 15ms median / 22ms p99 for Meta in AWS).

6 Likes

Fair. So to clarify, since most of this increase in redis time is in the message-bus, and with similar results to meta in TopicsController#show, maybe the increase in redis isn’t the source of the front end time?

Yes. Are you using our mini-profiler?

It’s great to track down some slow queries while browsing the site normally.

5 Likes