Admin banner notification of error floods at /logs

sam · February 11, 2016, 1:18am

We use the awesome Logster to display our error logs on all Discourse sites in the /logs path.

In many cases we noticed “log floods” that do not get notified. Log floods are strong indicators that stuff is seriously shape.

Perhaps anonymous has zero access to your site
The job queue might be completely broken
You may have a plugin that makes certain pages on the site not work
And more

As it stands as an administrator these failure modes could be in-progress and you could remain completely oblivious of them happening.

To help take care of this issue we would like to add some extensions to Logster and Discourse to notify us of a ticking

Logster extensions

Add internal tracking for “warnings/errors/fatal in last N seconds”, allow for multiple tracking modes, default is none.

Last N seconds needs to be bucketed somehow, this is how I would implement this which is more or less correct.

Logster.store.track_event_rate([:warning], 1.minute)
Logster.store.track_event_rate([:warning,:error], 1.hour)

num_events = Logster.store.lookup_event_rate([:warning], 1.minute)

# Possibly, concurrency is really hard here (possibly a v2 thing):
# The tricky thing is that you don't want to be going crazy and flooding the message bus
# each time people add 1 to an already exceeded count
# You would only fire this event once every 1.minute max (or something like that) 

Logster.store.on_event_rate_exceeded([:warning], 1.minute) do |rate|
   MessageBus.publish("/error_rate", "OH NO #{rate}")
end

Bucketing counts in Redis

Say we are trying to track number warnings in last minute simplified example:

Get my current “seconds since epoch” > value is say 601.
Divide this number by “number of tracking buckets” which will be “6”: we get 100
INCR “TRACKING_100”
If INCR returns 1, be sure to expire the key in 70 seconds
At any point we can get our “errors per second” by grabbing the values of “TRACKING_100” + “TRACKING_99” etc.
We can use LUA here to make it all an atomic operation, that way you call a routine to increment and it gives you the current rate. (totally happy for this feature to depend on redis having lua going)

Discourse extensions

On Discourse side we need a simple site settings for:

alert_admins_if_errors_per_minute
alert_admins_if_errors_per_hour

Then treat warnings/errors/fatal all as errors (so track an aggregate)

If the rate is exceeded use this kind of screen to notify admins in the same spot we use for this:

Ideally the earlier we notify the better, but really depends on the logster implementation.

At any point an admin can click X and make it go away for a minute, to come back again if it exceeds the rate again.

tobiaseigen · February 11, 2016, 5:03am

Yes, a lovely idea. Thank you! I also think it would be handy to also get an email, or if this has to be done in a discourse way, to get a message similar to the “archive completed”. That way admins would be notified by email if they are offline as well.

tgxworld · February 18, 2016, 8:56am

Two PRs up for the initial feature

I’ll work on this once the main feature is stable

oburt · March 30, 2016, 2:27am

Following up that a week later and I am still receiving the “3 errors/hour” notice a couple times a day and the logs are still empty. Anyone know if there are other logs somewhere that might also be keeping track of these errors?

Or what mechanism of the tagging plugin might be continuing to try to add error-causing tags without user input?

codinghorror · March 30, 2016, 2:30am

I believe there were a fair number of bugs here @tgxworld can elaborate.

tgxworld · March 30, 2016, 6:49am

There was a bug where we were tracking the error rate way too early in the code. Errors that are supposed to be ignored by Logster were being tracked as well.

Btw, what did you set your error rate per hour limit to?

oburt · March 30, 2016, 2:29pm

OK, got it. I presume the update to beta14 that I just got pushed takes care of this then?

3 errors per hour. Left the per minute disable at 0 though.

Topic		Replies	Views
Discourse Logging Improvements Feature rfc , spec	5	3095	January 28, 2016
How to alert admins if there is an error in a custom plugin? Dev	5	995	September 20, 2017
Notifications are coming again even after reading them [Private Topics plugin] Support notifications	20	1260	December 1, 2023
Errors in /logs since the latest update Support	4	106	September 1, 2024
Lots of errors in Logs Support	7	831	May 23, 2019

Admin banner notification of error floods at /logs

Logster extensions

Discourse extensions

Related topics