We use the awesome Logster to display our error logs on all Discourse sites in the /logs
path.
In many cases we noticed “log floods” that do not get notified. Log floods are strong indicators that stuff is seriously shape.
-
Perhaps anonymous has zero access to your site
-
The job queue might be completely broken
-
You may have a plugin that makes certain pages on the site not work
-
And more
As it stands as an administrator these failure modes could be in-progress and you could remain completely oblivious of them happening.
To help take care of this issue we would like to add some extensions to Logster and Discourse to notify us of a ticking
Logster extensions
- Add internal tracking for “warnings/errors/fatal in last N seconds”, allow for multiple tracking modes, default is none.
Last N seconds needs to be bucketed somehow, this is how I would implement this which is more or less correct.
Logster.store.track_event_rate([:warning], 1.minute)
Logster.store.track_event_rate([:warning,:error], 1.hour)
num_events = Logster.store.lookup_event_rate([:warning], 1.minute)
# Possibly, concurrency is really hard here (possibly a v2 thing):
# The tricky thing is that you don't want to be going crazy and flooding the message bus
# each time people add 1 to an already exceeded count
# You would only fire this event once every 1.minute max (or something like that)
Logster.store.on_event_rate_exceeded([:warning], 1.minute) do |rate|
MessageBus.publish("/error_rate", "OH NO #{rate}")
end
Bucketing counts in Redis
Say we are trying to track number warnings in last minute simplified example:
- Get my current “seconds since epoch” > value is say 601.
- Divide this number by “number of tracking buckets” which will be “6”: we get 100
- INCR “TRACKING_100”
- If INCR returns 1, be sure to expire the key in 70 seconds
- At any point we can get our “errors per second” by grabbing the values of “TRACKING_100” + “TRACKING_99” etc.
- We can use LUA here to make it all an atomic operation, that way you call a routine to increment and it gives you the current rate. (totally happy for this feature to depend on redis having lua going)
Discourse extensions
On Discourse side we need a simple site settings for:
alert_admins_if_errors_per_minute
alert_admins_if_errors_per_hour
Then treat warnings/errors/fatal all as errors (so track an aggregate)
If the rate is exceeded use this kind of screen to notify admins in the same spot we use for this:
Ideally the earlier we notify the better, but really depends on the logster implementation.
At any point an admin can click X and make it go away for a minute, to come back again if it exceeds the rate again.