Downloaded Backups are invalid (incomplete download)


(Dean Taylor) #1

When downloading large backups via the admin interface the download is incomplete / invalid.

#Steps to Reproduce

  1. Go to admin of a site containing a large backup
  2. Select the backup button
  3. Select “Yes”
  4. Wait for backup to complete
  5. Check backup created a file in excess of 2GB
  6. Select the “Download” button next to the 2GB download
  7. Wait for download to complete
  8. Transfer backup file via SSH from the /var/discourse/shared/standalone/backups/default/ folder to local machine
  9. Wait for SSH transfer to complete

#Expected
File downloaded via SSH to match the size of the file downloaded via the admin web interface.

#Actual
File downloaded via SSH is 2.04 GB (2,191,337,195 bytes)
File downloaded via web interface is 1.01 GB (1,094,332,416 bytes) and incomplete.
Each attempt at downloading the same file via the web interface downloads a different size file.

#Extra info
Backups ~245MB download correctly (these are my DB only backups).

Here are my attempts at downloading some backups:

I also tried downloading with Fiddler proxy to see if I could spot anything - headers appear valid.


Backups corrupted?
"Download Remote Images" setting reset during upgrade?
(Sam Saffron) #2

This is likely NGINX config issue we need to sort out.

Google points at this: Optimizing Nginx for serving files bigger than 1GB | Nginx Tips

Reading.


(Sam Saffron) #3

No repro on a 1.2GB size file on a clean docker image.

I suspect this is being truncated by the host with some sort of filterting or something is causing NGINX to hangup.

We should make these downloads resumable.


(lid) #4

will there be any error log? from nginx.
If let say the nginx worker got killed?


(Sam Saffron) #5

its sendfile causing this, I have a repro of the issue, we are blowing ram and getting out of memory.


(Sam Saffron) #6

OK I spent way too much time on this today and need to cut it for now.

You can try adding sendfile off; to location node in /etc/nginx/conf.d/discourse.conf it may help. I am uncertain cause I can still see the kernel takes a huge hit when you serve a big file.

To add insult here, if I bounce nginx half way through downloading a huge file, browser is just pretending that the file is downloaded fine. Even though the actual payload does not match content length.

I am a bit stumped here, perhaps @supermathie has some ideas.


(Jens Maier) #7

Why on earth would sendfile consume “a lot” of memory? It’s zero-copy, the only memory that should be consumed at all are buffers in the TCP send queue, and those should be rather small. Or did you tune the net.ipv4.tcp_wmem setting?


(Dean Taylor) #8

Perhaps sendfile isn’t actually being used because content-length isn’t being sent - I thought that was automatic with sendfile.


(Sam Saffron) #9

I am pretty sure I saw a 1gb kernel jump for a 1.5gb download, will re attempt on a clean digital ocean instance


(Jens Maier) #10

Hmmm, on second thought, you’re probably (partially) right. I’d expect the kernel to internally mmap the given fd, which means that the file content will get cached in memory for a while. However, these pages will quickly get dropped if another process requests memory.

But now I wonder: would mmap cache pages get listed as buff or cache in vmstat’s output? :wink:


(Michael Brown) #11

What browser are you using to perform the downloads? Have you tried different browsers? Different OSes?


(Dean Taylor) #12

Using the following browsers on Windows 7:

  • Google Chrome 37.0.2062.120 m
  • Mozilla Firefox 32.0.2

Firefox was just tested and only downloaded 1.01 GB (1,091,186,688 bytes)


(Dean Taylor) #13

@supermathie Just attempted the download using CURL on the host machine:

Each time the download was a different size.

CURL version details:

root@forum:/tmp# curl -V
curl 7.35.0 (x86_64-pc-linux-gnu) libcurl/7.35.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smtp smtps telnet tftp
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP

(Dean Taylor) #14

@sam Thank you for implementing the progress bar / content-length feature request, I can report that the progress is now displayed.

Sadly this does not help with this incomplete downloads issue.

Here is a screen grab of about as far as it gets - the download then ends prematurely.


(Sam Saffron) #15

curious … if you pause/resume does it work through the hump ?


(Dean Taylor) #16

Nope pausing for ~20 seconds at the ~900MB mark had no impact.

Attempted download via CURL again noted additional error info bytes remaining to read and indication of 64%, 66%, 53% complete:

<"%; __profilin=p"%"3Dt" -H "Connection: keep-alive" -H "Cache-Control: no-cache" --compressed
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 64 2141M   64 1371M    0     0  45.1M      0  0:00:47  0:00:30  0:00:17 48.3M
curl: (18) transfer closed with 806967306 bytes remaining to read
<; __profilin=p"%"3Dt" -H "Connection: keep-alive" -H "Cache-Control: no-cache" --compressed
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 66 2141M   66 1419M    0     0  44.9M      0  0:00:47  0:00:31  0:00:16 46.5M
curl: (18) transfer closed with 757057546 bytes remaining to read
<; __profilin=p"%"3Dt" -H "Connection: keep-alive" -H "Cache-Control: no-cache" --compressed
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 53 2141M   53 1136M    0     0  36.0M      0  0:00:59  0:00:31  0:00:28 39.6M
curl: (18) transfer closed with 1053358090 bytes remaining to read

(Michael Brown) #17

This relates to nginx serving proxied requests that are large, whereas for what we’re doing nginx should just be able to use sendfile to fire the existing file off to the client.


(lid) #18

I can confirm that on a local instance. download randomly break at around ~1GB

For the testing I generated a 2GB file in the backup folder.

Then I raised nginx error level to debug and got “upstream prematurely closed connection while reading upstream”

When Nginx realized the the worker/request died, it kill the download.
What I am not sure is if the download is delegated to nginx why is it still keep ties with the unicorn.

I changed the timeout just for testing from 60 to 3000. And I was able to successfully finish the 2GB file download.

:white_check_mark:
/var/www/discourse/config/unicorn.conf.rb


(Sam Saffron) #19

Something is super fishy here, I can confirm that terminating unicorn once the transfer starts kills the transfer, a very strong indicator that X-Accl-Redirect is not working right


(Sam Saffron) #20

This should be fixed now

https://github.com/discourse/discourse/commit/dc8eb6d73717057ca4a3e947454fafe1ab43a25a

Caveat, you must ./launcher rebuild app to pick up this change.