Very slow discobot certificate generation on our server

We have been pushing our new members to go through the discobot tutorial and quite a few are hitting a problem we had already experienced in a previous (staging) Discourse instance: the certificate in the last post takes a long time to generate and to actually appear, with no indication that it is being loaded/generated. By long time I mean a 10-20s period — enough for someone to give up on it. I tried on meta and it seems to be instant here.

I noticed that the certificate can be generated on demand using the /discobot/certificate.svg?date=Oct+28+2020&user_id=123 URL, so I did some tests. At first, it was instant. But as soon as I used user ids that didn’t do the tutorial recently, I had to wait the 10-20s period (it’s fast once generated the first time, so I assume there’s cache somewhere). I also got some 500 errors during the waiting period which I assume was a timeout of some sort because it worked on the next refresh.

On the /logs, I’m seeing a warning which I don’t know if it is related:

Failed to process hijacked response correctly : Net::ReadTimeout

traceback
/usr/local/lib/ruby/2.6.0/net/protocol.rb:217:in `rbuf_fill'

/usr/local/lib/ruby/2.6.0/net/protocol.rb:191:in `readuntil'

/usr/local/lib/ruby/2.6.0/net/protocol.rb:201:in `readline'

/usr/local/lib/ruby/2.6.0/net/http/response.rb:40:in `read_status_line'

/usr/local/lib/ruby/2.6.0/net/http/response.rb:29:in `read_new'

/usr/local/lib/ruby/2.6.0/net/http.rb:1509:in `block in transport_request'

/usr/local/lib/ruby/2.6.0/net/http.rb:1506:in `catch'

/usr/local/lib/ruby/2.6.0/net/http.rb:1506:in `transport_request'

/usr/local/lib/ruby/2.6.0/net/http.rb:1479:in `request'

rack-mini-profiler-2.0.2/lib/patches/net_patches.rb:9:in `block in request_with_mini_profiler'

rack-mini-profiler-2.0.2/lib/mini_profiler/profiling_methods.rb:39:in `step'

rack-mini-profiler-2.0.2/lib/patches/net_patches.rb:8:in `request_with_mini_profiler'

/var/www/discourse/lib/final_destination.rb:370:in `block in safe_get'

/var/www/discourse/lib/final_destination.rb:414:in `block in safe_session'

/usr/local/lib/ruby/2.6.0/net/http.rb:920:in `start'

/usr/local/lib/ruby/2.6.0/net/http.rb:605:in `start'

/var/www/discourse/lib/final_destination.rb:411:in `safe_session'

/var/www/discourse/lib/final_destination.rb:362:in `safe_get'

/var/www/discourse/lib/final_destination.rb:131:in `get'

/var/www/discourse/lib/final_destination.rb:152:in `get'

/var/www/discourse/lib/file_helper.rb:55:in `download'

/var/www/discourse/plugins/discourse-narrative-bot/plugin.rb:113:in `fetch_avatar'

/var/www/discourse/plugins/discourse-narrative-bot/plugin.rb:98:in `block in generate'

/var/www/discourse/lib/hijack.rb:56:in `instance_eval'

/var/www/discourse/lib/hijack.rb:56:in `block in hijack'

/var/www/discourse/lib/scheduler/defer.rb:94:in `block in do_work'

rails_multisite-2.3.0/lib/rails_multisite/connection_management.rb:68:in `with_connection'

/var/www/discourse/lib/scheduler/defer.rb:89:in `do_work'

/var/www/discourse/lib/scheduler/defer.rb:79:in `block (2 levels) in start_thread'

I take it that the certificate is generated locally, without any external network requests?

The machine is t3a.medium (4gb of memory) and the forum is not particular busy (~0.4 of load, plenty of free memory still). During the wait for the certificate to generate, the CPU didn’t seem to change at all, so that doesn’t seem to be the bottleneck.

We are not the only ones experiencing this (not sure if this is related), but I don’t see anything there that can help.

Not sure if I’m onto something, but if I refreshed the certificate URL a few times, eventually I get an error telling me to slow down. Is there some URL throttling mechanism that may be interfering here if many people are generating the certificates at that same URL path? Possibly not, but I’m at loss on why this is happening. Any hints welcome.

5 Likes

That route is indeed slow, as it is knew.

Other than trying to make it faster, maybe we can add a copy change to the message telling the user that Discobot is preparing something and that it will take a minute?

4 Likes

I wasn’t aware that it was known to be slow — if nothing else, I guess an update on the post to make that clear may help, but I still expect some people to give up on it if the wait is long.

I must say that 10 to 20 seconds to generate seems unusually long to output a pre-existing svg (I assume?) filled with some text? As far as I can see, this doesn’t seem to be CPU bound. What’s the bottleneck here?

I was looking at editing the copy here to include a “please wait, I’m generating something cool for you!” sentence in the previous step… but on try.discourse.org the certificate generation is VERY fast:

discobot last step timestamp: Nov 2, 6:46 PM 1604371566089
discobot cert gen step timestamp: Nov 2, 6:46 PM 1604371567470

I don’t exactly know what the units are on the underlying timestamps, but that’s a difference of 1381.

I replied as fast as I could on the same topic, and got these timestamps, easily less than a few seconds apart:

1604373488630
1604373492182

That is a difference of 3552 … and I literally posted as fast as I could, via pasting and keyboard shortcuts!

So… does your server just… suck… or something?

I believe he is taking about the generation of the certificate image, not the post.

I am indeed talking about the generation of the certificate — not the reply post that embeds it (which is quick). This is what this issue is causing to some of our members:

image

No, a t3a.medium is a middle range instance and the load is not high.

As I said, this does not seem to be CPU bound because while I wait for a, eg /discobot/certificate.svg?date=Oct+28+2020&user_id=123 request to finish, the CPU is not more busy than any other time (which isn’t very busy).

The issue seems to be related with a network request timing out:

The avatar on the certificate seems to be a data:image/png;base64 blob. From what I can see, there’s an attempt to download an avatar that is timing out? Perhaps that’s what is slowing things here?

For context, this is in a forum is with SSO and avatars are https URLs stored on S3. I have tried fetching some of those avatars from within the docker container and it works and is nearly instant — no firewall issues, etc. Everything else on the forum works fast, except when generating the certificate which takes a long time.

Happy to try to debug this further if I’m given pointers on what to check.

The image appeared more or less instantly for me on try.discourse.org … I didn’t notice a significant delay of more than maybe half a second if even that? give it a shot yourself if you don’t believe me!

So I’m back to what I originally said: why is this so slow on your server, yet so fast on try.discourse.org ?

Why indeed… That’s what I’m trying to get support about.

I understand that this is a “it works in my machine” kind of scenario that shouldn’t even exist with docker, but alas, here we are and it doesn’t seem I’m the only one.

Any hints on how to debug this on my side would be appreciated.

1 Like

I spent some time looking into this and the issue seems to be that the discourse-narrative-bot blocks when fetching the avatar of the user to include on the certificate. Here:

This forum uses SSO and the avatars are Amazon S3 URLs — which are fast and not rate limited by Amazon. That said, that function is fetching an avatar URL of the form:

https://forum.tld/user_avatar/forum.tld/username/240/18705_2.png

Could the timeout errors that are slowing down the certificate generation be from hitting Discourse’s rate limiting somehow? Either by having the avatars being requested too fast from Discourse instance itself, or, from trying to hit Amazon S3 too fast when multiple users are doing the bot tutorial?

3 Likes

Interesting. We inline a base64 image in there. I wonder why we didn’t just link to the external resource in the image svg tag :thinking:

I dived into the history of the plugin all the way til where we first added this:

And looks like we always used base64 inlined avatars.

EDIT: After some research, we inline with base64 because the certificate is loaded via a img HTML tag. To get cheaper avatars we would have to move to a object tag, which is default blocked in our CSP. Or inline the entire SVG in the post content. I guess an iframe is our best shot.

I have FIX: Make discobot certificate gen faster/non blocking by xfalcox · Pull Request #11344 · discourse/discourse · GitHub for @tgxworld comments.

5 Likes

I assumed that the svg embedding the base64’ed avatar was for the same reason it “was doing” the logo: so that the certificate could be a snapshot in time — maybe saved by the awardee, without any external dependencies, so that it would still render later the exact same way.

The iframe approach prevents it from being easily shared and bragged about on a post. I don’t know how common this is on other forums, but I have seen some of our users doing it by just copying the URL of the image.

If going with linking to something external, wouldn’t using <image> work here and avoid the need to change allowed_iframes?

That is what the above PR does if you read it :grimacing:

But if you load the svg with the HTML img tag as we do today, it won’t load external images contained in the SVG image tags. It must be loaded using either object or iframe for that to work.

3 Likes

Clearly I don’t know enough about SVG. The iframe approach prevents the certificate from being quoted/copied, but nonetheless I’m happy that this is being addressed somehow as it will fix our issue and also the missing logo on the certificate. :+1:

2 Likes

@mentalstring I just merged the new certificate generation code. Should be a lot faster now.

4 Likes

Great to see this being merged. :+1:

We are on stable and I don’t have a good way to test this right now. But if fetch_avatar is no longer used, I expect our Net::ReadTimeout errors to go away with this. Marking this as solved and will open again if for some reason I hit the same again when we update. Thanks!