Huge network traffic on NAS Storage

I am hosting all of my upload files on a NAS Storage (glusterfs).

Recently I found that there is a huge and constant network traffic on the NAS. And traced it down to discourse requesting for optimized images. Is there a job that constantly lookup these images? why? and How can I turn it off?

btw clean up uploads site settings is disabled in my forum.

Possibly the backfill @david added for looking up primary image color.

It will eventually finish and return to a steady state

We need to walk all the images for the backfill, you may be able to work around by forcing the color on all images to white or something

As far as I see,

It is working on 25 images per 15 minutes. yes? this should be very negligible. I am seeing thousands of files being lookup every minutes.

and also looking at the bandwidth from 6 month ago, I see the same behaviour. So I think it should be something else.

However I’m pretty much sure its being done by a discourse job or somthing similar. cause when I stop discourse app, the bandwidth go away. However when I just stop discourse nginx app, the bandwidth still remains.

1 Like

Have a look in /sidekiq it shoul tell you which jobs are running, be sure to click all tabs

1 Like

No job is running. :thinking: . Is there some other jobs that wouldn’t be listed here?

Or maybe there is something in the container that tries to index files?

All our background logic happens on Sidekiq jobs. If no job is running and you still have high disk I/O it may be users visiting your website and images being served by nginx ?

Do you have a caching CDN fronting static assets?

I tested this previously.

:point_down:

So its not because users visiting website. If so, when I stopped nginx, the traffic should go away.

You will need to use the Linux inspection tools to see what exactly PIDs and syscalls are being made then.

2 Likes

@Falco @sam I think I found the root cause.

First I restarted the discourse app so that the constant traffic go away. Then I went to admin panel and went to the section for bulk reports. Its been a long time that reports dont show properly here:

Immediately after the reports are being timeout, I see the jump in network bandwidth. And I see this error in error logs:


'hijack admin/reports bulk ' is still running after 90 seconds on db default, this process may need to be restarted! 

What is going wrong here?

Is the database in the same NAS storage?

No the database is on the physical ssd disk.

only upload folder is on nas

So there is no correlation between those. Back to

In fact I think maybe there is a correlation. in my test environment here it calculates the used space.

I think calculating the used space on a NAS folder with a lot of files would be very much time consuming and the root cause of high bandwidth.

Am I right? :thinking:

2 Likes

Does running

df -Pk

df -P

du -s

take a significant amount of time on the network share?

these two were instant

df -Pk

df -P

However du -s resulted in a similar behavior I reported above.

And it was running for about 5 minutes and didn’t finish and I needed to terminate it manually.

1 Like

Oh I see. That report result is cached but I guess it never finishes and can’t be cached because you network share is too slow.

So is there anything we can do to prevent this? For example treat it like s3 uploads that we don’t calculate disk size

1 Like