Admin banner notification of error floods at /logs

We use the awesome Logster to display our error logs on all Discourse sites in the /logs path.

In many cases we noticed “log floods” that do not get notified. Log floods are strong indicators that stuff is seriously :pear: shape.

  • Perhaps anonymous has zero access to your site

  • The job queue might be completely broken

  • You may have a plugin that makes certain pages on the site not work

  • And more

As it stands as an administrator these failure modes could be in-progress and you could remain completely oblivious of them happening.

To help take care of this issue we would like to add some extensions to Logster and Discourse to notify us of a ticking :clock1: :bomb:

Logster extensions

  • Add internal tracking for “warnings/errors/fatal in last N seconds”, allow for multiple tracking modes, default is none.

Last N seconds needs to be bucketed somehow, this is how I would implement this which is more or less correct.

Logster.store.track_event_rate([:warning], 1.minute)
Logster.store.track_event_rate([:warning,:error], 1.hour)

num_events = Logster.store.lookup_event_rate([:warning], 1.minute)

# Possibly, concurrency is really hard here (possibly a v2 thing):
# The tricky thing is that you don't want to be going crazy and flooding the message bus
# each time people add 1 to an already exceeded count
# You would only fire this event once every 1.minute max (or something like that) 

Logster.store.on_event_rate_exceeded([:warning], 1.minute) do |rate|
   MessageBus.publish("/error_rate", "OH NO #{rate}")
end

Bucketing counts in Redis

Say we are trying to track number warnings in last minute simplified example:

  • Get my current “seconds since epoch” > value is say 601.
  • Divide this number by “number of tracking buckets” which will be “6”: we get 100
  • INCR “TRACKING_100”
  • If INCR returns 1, be sure to expire the key in 70 seconds
  • At any point we can get our “errors per second” by grabbing the values of “TRACKING_100” + “TRACKING_99” etc.
  • We can use LUA here to make it all an atomic operation, that way you call a routine to increment and it gives you the current rate. (totally happy for this feature to depend on redis having lua going)

Discourse extensions

On Discourse side we need a simple site settings for:

  • alert_admins_if_errors_per_minute
  • alert_admins_if_errors_per_hour

Then treat warnings/errors/fatal all as errors (so track an aggregate)

If the rate is exceeded use this kind of screen to notify admins in the same spot we use for this:

Ideally the earlier we notify the better, but really depends on the logster implementation.

At any point an admin can click X and make it go away for a minute, to come back again if it exceeds the rate again.

6 Likes

Yes, a lovely idea. Thank you! I also think it would be handy to also get an email, or if this has to be done in a discourse way, to get a message similar to the “archive completed”. That way admins would be notified by email if they are offline as well.

4 Likes

Two PRs up for the initial feature :slightly_smiling:

I’ll work on this once the main feature is stable :slightly_smiling:

4 Likes

Following up that a week later and I am still receiving the “3 errors/hour” notice a couple times a day and the logs are still empty. Anyone know if there are other logs somewhere that might also be keeping track of these errors?

Or what mechanism of the tagging plugin might be continuing to try to add error-causing tags without user input?

I believe there were a fair number of bugs here @tgxworld can elaborate.

1 Like

There was a bug where we were tracking the error rate way too early in the code. Errors that are supposed to be ignored by Logster were being tracked as well.

Btw, what did you set your error rate per hour limit to?

3 Likes

OK, got it. I presume the update to beta14 that I just got pushed takes care of this then?

3 errors per hour. Left the per minute disable at 0 though.

2 Likes